Insurance Quote Conversion : Binary Classification Problem
Which customers will purchase a quoted insurance plan?
This is a past Kaggle competition in which the participants where challenged to find whether a customer whom we are going to reach out will accept or reject the quote from the tele-marketer. Even though the dataset in this case was provided by a insurance company this is a generic problem applicable to many industry domain . Basically if we had a crystal ball and knew accurately the people who are going to be our future customer it would be very easy to assign marketing budget and strategies by performing targeted offers. Unfortunately real world is messy and we need to find our way.
This is where machine learning models can help and guide us . This article I will share my way of handling this problem by training a machine learning model on the given data. I will try to explain each step on why I took a certain decision.
Predicting between two choices is usually called supervised binary classification problem in the world of machine learning where we need to predict between two outcome positive and negative , in this case ( customer conversion or customer not converted).
THE DATASET. AND EVALUATION METRIC
The dataset provided was huge one 260K sample and each sample having 299 features . To add to this challenge each feature was anonymized . This prevented from creating any meaningful features. From the problem statement ,I realized that this model might be deployed in near real time where we could require the prediction quicker , hence I decided to build a model light model with minimum number of features and maximum possible accuracy. Also the dataset provided was unbalanced where people who did not purchase the quote where more than people who purchased it , hence I decided to set up a validation metric as ROC-AUC score.
AUC — ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s. By analogy, the Higher the AUC, the better the model is at distinguishing between two classes. A random model will have ROC-AUC score of 0.5 and our target should be getting our model to ROC-AUC score of 1.
FEATURE SELECTION :
This is an important step in developing any machine learning model . Real world data is messy and noisy . If we train a model of such a noisy data we would most likely not get a very good model so its very important to filter out noise on the data . Also as my aim was to have a quick model for inference hence it was important to extract maximum information from minimum features. From my exploratory data analysis I saw that few columns carried a lot of null values like close 50% where null so I decided to drop them . Few column has just one single value so I dropped such columns as well.
After this I procced to implement the below feature selection strategies. I decided to keep 50 best features plus target column from the given 299 features.
- Mutual information: Mutual information (MI) between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.
- Reclusive Feature Elimination: Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
After going through the narrowed down column I decided to treat them categorical as all values in them where discreate and the unique values in them varied from 2 to 26.
Model Building
Seeing the nature of the problem I decided to try logistic regression as my baseline model and one of the boosted tree algorithms to beat my baseline score. Out of the three famous boosting algorithm implementation I decided to use LightGBM as its faster to train and is memory efficient at inference .
For logistic regression I had to covert my 50 features into one-hot encoding which created a sparse matrix of close to 900 columns which took a long time to train and which will also take a long time at prediction. Even with such a simple and inefficient model I was able to get a mean ROC-AUC score of 0.95 using 5-fold cross validation. Also since the dataset was not balanced all my 5 folds where stratified in the same ratio.
Gradient Boosting methods have a lot of parameters to tune which if not handled carefully can overfit out dataset and give use a false picture of our model generalization. Hence I decided to split data into train and validation set. I used Optuna to tune the hyperparameters of the my Light-GBM model. Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API. The code written with Optuna enjoys high modularity, and the user of Optuna can dynamically construct the search spaces for the hyperparameters. It also allows use to set criteria to kill un-promising optimization runs after few epochs which allows use to search a larger hyperparameter space.
Running and Optimizing Light-GBM on features obtained through Mutual information gave me a ROC-AUC score of 0.93 and Reclusive Feature Elimination (RFE) features gave me a ROC-AUC score of 0.9641 . Since I was able to beat my baseline of logistic regression I decided to use the Light-GBM model with RFE selected features on the hidden test set.
I implemented a Sk-learn pipeline with ability to handle features which where not encountered in training set . This will prevent the model from breaking in production if some unknown values is encounter. Above image shows the lines of code to split the data in training and validation and then passing the train data through the pipeline and then predicting on the validation of hold out dataset .
I was able to get a private test set score of 0.962 which was very close to my train set score of 0.9641
Points to improve the model
There was a date column which I decided to drop . But in my next iteration I am planning to create some more features out of it and fit the same Light GBM model to see if it can increase my score
The final model implementation: Link to my Notebook
Notebook showing the EDA and Optimization for the above code : Will be added in soon to this post
Please like the post and upvote the notebook if you found it useful.
Follow me on : LinkedIn