American History (4,233)
Biographies (1,168)
Book Reports (3,862)
Business (17,294)
English (13,871)
History Other (3,821)
Miscellaneous (12,648)
Music and Movies (1,106)
Philosophy (1,165)
Psychology (1,486)
Religion (953)
Science (2,671)
Social Issues (7,924)
Technology (1,924)

The Santander Customer Competition

Essay by vvghelu • July 22, 2016 • Research Paper • 1,459 Words (6 Pages) • 1,404 Views

Essay Preview: The Santander Customer Competition

prev next

Report this essay

Page 1 of 6

Santander Customer Satisfaction

March 28, 2016

1 Tabel of Contents

Intro

Results

Appendix

{ Data

{ Principal Component Analysis

{ Feature selection

{ Fitting the model

{ NearestNeighbour

{ LogisticClassier

{ XGboost

{ RandomForestClassier

{ ExtraTreesClassier

Evan's suggestions

Model tunning

Kaggle submission

Resources

Authors

2

2 Intro

The Santander Customer Competition on Kaggle provides us with a synthetic data set with 370 numerical

variables. Using those variables the task is to predict whether a customer is satised or not. The evaluation

metric is ROC AUC.

3 Results

The data set contains synthetic data, i.e. it is anonymized. There were duplicate columns and columns with

zero variance (standard deviation is zero) which we removed. This step reduced the number of independent

variables from 370 to 308.

Using principal components analysis we can see in the plot (see appendix) that the two clusters of

customers, satised and not, overlap quite a lot which makes it more dicult for the classiers to perform

well. Five principal components explain 96% percents of data (see in appendix). At this point we train the

classiers on the original data after performing feature selection.

For feature selection we opted for a randomized trees classier, aka extra trees, that computes the

importance coecients of the features that are used for feature selection. We end up with 36 important

features. This step reduced the number of indepent variables from 308 to 36. We use the 36 features to train

our classiers.

We tried dierent models to see which performs the best with the intent to concentrate on the promising

candidates and improve them by tunning parameters through statistical analysis. We expect that gradient

tree boosting implemented in xgboost python module will be the best model since it is an ensemble method.

At the end the xgboost classier performed the best with the ROC area under the curve of 0.838 . For

comparison, the best score on kaggle leaderboard is 0.842 as of 3/14/2016. Unfortunately, our score puts us

into the top 800, so there is room for improvement.

After tunning our XGBoost model and running it on three datasets with randomly sampled variables we

were able to improve our score on Kaggle to 0.83905 which put us into the top 400 as of 3/10/2016. The

competition is open until May 2nd and we working on improving our model. At this point we present our

best solution so far. Going forward we will use more sophisticated sampling methods and a mix of classiers

in addition to XGBoost like AdaBoost and ExtraTreestClassier to make sure there is no overtting on the

training set. We are exploring the use of neural nets as well, however, to achieve signicant improvements

in the ROC AUC score we believe we need to focus on feature engineering.

4 Reproducibility

All code used to for this project is provided in the appendix. Additionaly, you can nd the IPython notebook

version of this write up at https://goo.gl/654D3o . The online version has helpful links.

3

5 Appendix

5.1 Data

First, we read in data.

In [ ]: import pandas as pd

train = pd.read_csv("train.csv")

test = pd.read_csv("test.csv")

In [ ]: train.iloc[:,0:5].head()

In [3]: train.iloc[:,0:5].describe()

Out[3]: ID var3 var15 imp ent var16 ult1 n

count 76020.000000 76020.000000 76020.000000 76020.000000

mean 75964.050723 -1523.199277 33.212865 86.208265

std 43781.947379 39033.462364 12.956486 1614.757313

min 1.000000 -999999.000000 5.000000 0.000000

25% 38104.750000 2.000000 23.000000 0.000000

50% 76043.000000 2.000000 28.000000 0.000000

75% 113748.750000 2.000000 40.000000 0.000000

max 151838.000000 238.000000 105.000000 210000.000000

imp op var39 comer ult1

count 76020.000000

mean 72.363067

std 339.315831

min 0.000000

25% 0.000000

50% 0.000000

75% 0.000000

max 12888.030000

In [121]: # Number of rows and columns including dependent variable TARGET

train.shape

Out[121]:

...

...

Download as: txt (15.2 Kb) pdf (135.4 Kb) docx (16.6 Kb)

Continue for 5 more pages »

Read Full Essay Save

Only available on Essays24.com

Similar Essays

The Role Of Customer's Perception Of Service Brand On The Customer Value Creation Process

ABSTRACT The focus of this research is the role of customer's perception of service brand on the customer value creation process. Three factors which represent

3,544 Words | 15 Pages
The Competition Between The Two Green Algae:

Abstract This experiment tested the hypothesis that the pure algae populations of Ankistrodesmus and Chorella will show growth in the number of organisms, and the

1,514 Words | 7 Pages
Customer Relationships Marketing

Literature Review The Evolving Sales and Marketing Landscape Marketing and business development professionals are confronting a rapidly different and changing business landscape. The traditional business

2,025 Words | 9 Pages
Study Claims Customers Prefer Cable To Telecom For Triple-Play

Study claims customers prefer cable to telecom for triple-play An annual survey conducted by Knowledge Networks/SRI claims that 29% of consumers believe that cable systems

317 Words | 2 Pages
Naglo Saxson Burial Customs

Anglo-Saxon Burial Techniques: Early Anglo-Saxon burials are traditionally based on cremation on a pyre, with the deposition of corpses in the ground in a pottery

645 Words | 3 Pages
Does Competition Bring Out The Best In Us?

COMPETITION BRINGS OUT THE BEST IN US The score was 14-15. My team was losing the sectional championship game by just one point. Now we

1,435 Words | 6 Pages
Competition

Examine the concept of competition. Explain how it works in market economies. In what way is it a crucial part of the business environment? 1000

1,135 Words | 5 Pages
Philosophy Of Competition

The views of competition are very wide spread. It seems that everyone has something to say on the subject. Those people who are in favor

541 Words | 3 Pages

Similar Topics

Browse 74,000+ Papers and Essays
Join 500,000+ Other Members
High Quality Documents

Sign up