Arvato Financial Solutions is a company that provides services that helps other companies through the complexity of credit management since 1961. Nowadays, 7,000 experts are delivering efficient credit management solutions in around 15 countries around the globe.
For the Capstone Project of Udacity Data Science Nanodegree, Arvato Financial Solutions kindly made its data of a real problem available so the students could apply the theory learned to a practical problem.
In this Capstone Project, the goal was to analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population.
Four main parts have to be performed to accomplish this goal:
- Get to know the data (data wrangling and some exploratory data analysis)
- Customer Segmentation Report ( using unsupervised learning techniques)
- Build a supervised learning model
- Kaggle Competition
There were four data files associated with this project:
Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).
Below we can see the 5 first rows of the Udacity_AZDIAS and Udacity_CUSTOMERS data.
The structure of azdias and customers df looks similar both having more than 360 columns, just the customer df has 3 more columns, ‘CUSTOMER_GROUP’, ‘ONLINE_PURCHASE’, ‘PRODUCT_GROUP’ which provide broad information about the customers.
There are 891.221 rows in the general population data frame and 191.652 at the customer’s data frame, so there are 699.569 potential new clients! Another thing that we can see is that there are a lot of NaN values in both data frames.
As expected in a real-life problem, for both dfs there are a lot of columns containing NaN values. For the Customers df, 75% of the columns have less than 27% of the values NaN and for the General df, 75% have less than 12%.
Analyzing the histogram, few columns have more than 40% of NaN values and those columns are ALTER_KIND 4,3,2 and 1, KK_KUNDENTYP and EXTSEL992. Of those columns there is the description of only KK_KUNDENTYP, so the others were dropped.
Since I did not have a robust environment to use more complex techniques, for the other NaN values I only replaced them for -1, which was the default value for unknown values at the data frame.
Comparing the General Population against the Customers
How does the General population differ from the Customers ?
The customers mostly are men, with high income and having more than 60 years when the general population are mostly women with average to very low income and having from 46 to 60 years.
Scaling and PCA
Since some columns are of the type object, I used the labelEncoder to transform all the objects to int. The labelEncoder works transforming each unique object to an int ranging from 0 to n, where n is the number of unique objects.
Having both data frames of numeric types, it’s time to scale the data. To scale the data I chose the minMaxScaler that scales all data to have values between zero and one.
This step is important to make the algorithm converge faster and to avoid that the scale of each feature act as a weight.
Considering that the data has a lot of features (366) I used Principal Component Analysis (PCA) to reduce the data dimensionality and get only the important features.
For this part I fit the PCA setting its n_components parameter to 0.9, so by the end of the process, the resulting data features would explain 90% of its variability.
The result of the PCA was a reduction of 72% in the number of features that before PCA was 366 and after 103.
The Unsupervised Learning algorithm chosen was K-means.
The most important feature to the K-means is the value of K, to find this value the two most used approaches are the elbow method and the silhouette score.
I used the elbow method getting the WSS from k ranging from 1 to 10 and you can see in the plot below.
We could see an elbow at k = 2, but analyzing each WSS I could see that there is a decrease of almost 35% at WSS from k=2 to k= 10, so I set K = 10 and trained the K means model.
The result of the K means model was that a cluster ranging from 0 to 9 was assigned to each row. Having this information, I plotted the proportion of people in each cluster for the general and customers data to find parts of the general population that are more likely to be part of the mail-order company’s main customer base, and which parts of the general population are less so.
Looking at this plot we can conclude that the General population that is close to the Customers are those of clusters 7, 2, and 4, and those that are less likely to become a client are the ones of clusters 0, 6, and 2.
So I got all the people of cluster 4 and plotted the same 3 graphs that I plot above to see if there is some change in the distribution.
Looking at the plots we can conclude that most people in cluster 4 are men having average income and having more than 60 years.
The goal of this section is to build a prediction model that using the demographic information from each individual we can decide whether or not it will be worth it to include that person in the campaign.
Udacity_MAILOUT_052018_TRAIN will be used to train the models and Udacity_MAILOUT_052018_TEST to test them.
The train data have the same columns from the General data adding only the target RESPONSE column that is a binary variable having a value of 1 if that row is a customer or 0 otherwise.
But, in all data frame, there are only 532 rows that contain the number one (1.2%) thus this is an imbalanced problem so accuracy shall not be used to measure the results of the model because a model that predicts only 0 would have an accuracy of 98.8% at the train data.
The metric that will be used is the Area Under the ROC Curve (AUC). The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) varying the threshold.
Data wrangling and cleaning
The first step of this process is also data wrangling, starting by adjusting some columns that had numeric values mixed up with strings. The columns CAMEO_DEUG_2015 and CAMEO_INTL_2015 had an X and a XX that were replaced for -1.
The distribution of NaN values at train data is similar to the distribution of the general data frame as can be seen in the plot below.
Dealing with those NaN values is a really important step that can influence the model outcome so some approaches will be tested:
- Dropping or not the columns ‘ALTER_KIND4’, ‘ALTER_KIND3’, ‘ALTER_KIND2’, ‘ALTER_KIND1’ that have more than 40% of NaN values and don’t have descriptions.
- Replacing the NaN values with -1.
- Replacing the NaN values with the statistical mode of each column.
- Using KNN to impute the NaN values.
LabelEnconder was used to transform the object’s columns into numeric and RobustScaler, which scale features using statistics that are robust to outliers, was used to scale the data.
Since this is an imbalanced problem, I chose algorithms that are usually used in that scenario. XGBoost Regressor, XGBoost Classifier, IsolationForest, and AdaBoostRegressor.
Another aspect that I tested was to use PCA along the training process and sampling techniques to balance the data.
I started by using the XGBoost model first using the Classifier version that had an AUC of only 0.49. But that was expected since the Classifier only predicts 0 or 1 and varying the threshold does not affect it.
The second trial was using XGB Regressor, which gave a better AUC of 0.62.
So I used RandomizedSearchCV and had a great result of 0.7904.
To try to improve this value I used PCA to maybe lower some overfitting, but the result with the PCA was 0.53.
Given that the XGB Regressor was already with the best hyperparameters I took the path to try improving the score by changing the data.
The first step was to make data more balanced using sampling techniques, so I used random oversampling, smote + random under-sampling, and ADSYN + random under-sampling.
But all trials gave an AUC that was lower than 0.7904.
The second step was to change how I was dealing with NaN values. First, I did not drop any column, second instead of replacing the NaN values with -1 I replaced it with the statistical mode of each column, and third using KNNImputer.
Unfortunately, the Udacity environment was not robust enough to run the KNNImputer, but only doing the 2 other changes I had the best AUC of 0.79277.
So I went to try Ada Boost Regressor and a Stacked model combining XGB Regressor with the Ada Boost. The Ada Boost regressor had an AUC of 0.768 and the stacked model an AUC of 0.789.
All the trials scores can be seen in the table below.
This was a really interesting real-life problem that made me apply almost all the concepts that I learned at the Nanodegree.
- I have to deal with a data frame containing 366 features, with a lot of NaN values and columns containing mixed-up objects and integers. So I had to do a lot of data cleaning and data wrangling.
- At the unsupervised learning part, I had to transform all the object columns into numeric ones so I could scale it and fit a K-Means model, using the K that minimizes the WSS and finding the general population that is closer to the customers. Probably, helping the company to decrease its customer acquisition cost.
- Finally, the funniest part was to try different supervised learning models to get the best score at the Kaggle competition having as a winner the XGBoost Regressor with the hyperparameters found using Random Search and without dropping any column of the data.
Next Steps and Considerations
Some actions can be made to try to improve the results. Other Clustering techniques such as DBSCAN can be used and the results could be compared with the Kmeans approach. For the supervised learning part, a random search at the parameters of the stacked model could improve the results, and using KNNImputer to deal with the NaN values could also improve the AUC.
To accomplish those results, I’ve read many tutorials, articles and documentation from https://machinelearningmastery.com, https://www.kaggle.com and of course from https://stackoverflow.com to get insights.
All the code used in this project can be found in this repository.