Essays24.com - Term Papers and Free Essays
Search

The Santander Customer Competition

Essay by   •  July 22, 2016  •  Research Paper  •  1,459 Words (6 Pages)  •  1,139 Views

Essay Preview: The Santander Customer Competition

Report this essay
Page 1 of 6

Santander Customer Satisfaction

March 28, 2016

1 Tabel of Contents

 Intro

 Results

 Appendix

{ Data

{ Principal Component Analysis

{ Feature selection

{ Fitting the model

{ NearestNeighbour

{ LogisticClassi er

{ XGboost

{ RandomForestClassi er

{ ExtraTreesClassi er

 Evan's suggestions

 Model tunning

 Kaggle submission

 Resources

 Authors

2

2 Intro

The Santander Customer Competition on Kaggle provides us with a synthetic data set with 370 numerical

variables. Using those variables the task is to predict whether a customer is satis ed or not. The evaluation

metric is ROC AUC.

3 Results

The data set contains synthetic data, i.e. it is anonymized. There were duplicate columns and columns with

zero variance (standard deviation is zero) which we removed. This step reduced the number of independent

variables from 370 to 308.

Using principal components analysis we can see in the plot (see appendix) that the two clusters of

customers, satis ed and not, overlap quite a lot which makes it more dicult for the classi ers to perform

well. Five principal components explain 96% percents of data (see in appendix). At this point we train the

classi ers on the original data after performing feature selection.

For feature selection we opted for a randomized trees classi er, aka extra trees, that computes the

importance coecients of the features that are used for feature selection. We end up with 36 important

features. This step reduced the number of indepent variables from 308 to 36. We use the 36 features to train

our classi ers.

We tried di erent models to see which performs the best with the intent to concentrate on the promising

candidates and improve them by tunning parameters through statistical analysis. We expect that gradient

tree boosting implemented in xgboost python module will be the best model since it is an ensemble method.

At the end the xgboost classi er performed the best with the ROC area under the curve of 0.838 . For

comparison, the best score on kaggle leaderboard is 0.842 as of 3/14/2016. Unfortunately, our score puts us

into the top 800, so there is room for improvement.

After tunning our XGBoost model and running it on three datasets with randomly sampled variables we

were able to improve our score on Kaggle to 0.83905 which put us into the top 400 as of 3/10/2016. The

competition is open until May 2nd and we working on improving our model. At this point we present our

best solution so far. Going forward we will use more sophisticated sampling methods and a mix of classi ers

in addition to XGBoost like AdaBoost and ExtraTreestClassi er to make sure there is no over tting on the

training set. We are exploring the use of neural nets as well, however, to achieve signi cant improvements

in the ROC AUC score we believe we need to focus on feature engineering.

4 Reproducibility

All code used to for this project is provided in the appendix. Additionaly, you can nd the IPython notebook

version of this write up at https://goo.gl/654D3o . The online version has helpful links.

3

5 Appendix

5.1 Data

First, we read in data.

In [ ]: import pandas as pd

train = pd.read_csv("train.csv")

test = pd.read_csv("test.csv")

In [ ]: train.iloc[:,0:5].head()

In [3]: train.iloc[:,0:5].describe()

Out[3]: ID var3 var15 imp ent var16 ult1 n

count 76020.000000 76020.000000 76020.000000 76020.000000

mean 75964.050723 -1523.199277 33.212865 86.208265

std 43781.947379 39033.462364 12.956486 1614.757313

min 1.000000 -999999.000000 5.000000 0.000000

25% 38104.750000 2.000000 23.000000 0.000000

50% 76043.000000 2.000000 28.000000 0.000000

75% 113748.750000 2.000000 40.000000 0.000000

max 151838.000000 238.000000 105.000000 210000.000000

imp op var39 comer ult1

count 76020.000000

mean 72.363067

std 339.315831

min 0.000000

25% 0.000000

50% 0.000000

75% 0.000000

max 12888.030000

In [121]: # Number of rows and columns including dependent variable TARGET

train.shape

Out[121]:

...

...

Download as:   txt (15.2 Kb)   pdf (135.4 Kb)   docx (16.6 Kb)  
Continue for 5 more pages »
Only available on Essays24.com