Credit Card Fraud Detection Using Unsupervised Learning

Published in

The Startup

5 min readFeb 5, 2021

If Machine learning would have a engineering subject during my under-graduate days unsupervised learning would be the chapter which many of my engineering friends would have kept as an optional read, focusing most of the attention on the other glamorous brother supervised learning.

If you pick any general machine learning text book you will surely find un-supervised learning always the last chapter of the book , but if you dig deeper un-supervised learning can be applied to a number of important task such as manufacturing defect detection ,labelling un-labeled samples, catching outliers in a dataset and fraud detection in a bank transaction which I am going to discuss here .

The aim of this post is to note down my thought process of tackling this problem in a unsupervised way .

Dataset Introduction

Notebook and Dataset source : Kaggle.com

The datasets contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

shape of the dataset :(284807 ,31)

No Null values in the dataset

Approach Available

There could be three approaches for the given problem .

1.Treating as supervised learning problem : Fitting various models on the dataset to classify
2.Treating as semi-supervised learning : Where you just map the correct data and any data points outside this domain are anomaly or fraud
3.Treating is a unsupervised learning : Where you try to isolate the fraud points from proper transaction without using labels

I am planning to tackle this dataset via the third approach. I am going to detect the fraud points in the dataset using two unsupervised learning algorithms described below:

1.Isolation Forest : This method in found in the sklearn.ensemble class of sci-kit learn. This algorithm builds a random forest in which each decision tree is grown randomly . At each node ,it picks a feature randomly ,then picks a random threshold value (between min and max value) to split the dataset in two . The dataset is gradually gets chopped into pieces . By definition anomaly should be a point which is far away from normal points in the dataset and thus in this algorithm the anomaly should get separated closer to the root of the tree.
The percentage of anomaly points should be specified before running isolation forest in contamination variable .
2.Gaussian Mixture :This method is found in sklearn.mixture class of sci-kit learn. A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians. Given the number of cluster ,the model basically solves for means and co-variance matrices of those clusters assuming they are normally distributed.

I tried reducing the dimension of the dataset using PCA into two and three dimension in order to visualize the fraud points but even in 3-dimension there is no clear distinction between the classes. Thus the only way to catch majority of the fraud transaction points is to increase the contamination factor or threshold value which we specify . This will result in more false positive points ( normal transaction getting flagged as fraud) but for this case false positive is better than false negative ( fraud transaction not getting identified)

With this in mind I selected the threshold value as 10% even though we have 0.172% fraud points in the dataset.

2D PCA : No clear distinction between classes

Model Fitting

The data was normalized using RobustScaler from sklearn.preprocessing class . Robust Scaler was chosen as the dataset has many outlier values which would have skewed the dataset using Standard Scaler.

Min Max scaler was also tried but the Robust Scaler gave a higher explained_variance_ratio_ along the first two dimension during PCA hence went with this . Below image show the various model parameters selected for the two models

In Gaussian mixture model we need to specify a threshold value after which the model computes the log of probability density function value for each instance . Any instance which this value below our threshold value is marked as anomaly.

Most parameters are self explanatory. Different values of n_estimators and max_samples where tried .

Result Discussion

To quantify how my models are doing I selected the ROC-AUC and AUCPR metric. For imbalance dataset we should consider the Area Under the Receiver Operating Characteristic Curve (AUC) and area under precision-recall curve(AUCPR). For AUC the baseline is 0.5 which is random guessing and perfect model is 1 . Similarly the AUCPR the perfect model score is 1 and the baseline score is the relative count of the positive class so for this dataset 0.00172

For Gaussian Mixture my ROC-AUC metric was 0.94 and AUCPR score was 0.052

For Isolation forest my ROC-AUC metric was 0.95 and AUCPR score was 0.23

Conclusion

I hope this post would have shed some light on the anomaly detection problem . My one pick from the above would be Isolation forest as it was better performing and also faster to train . There other some methods like DBSCAN an one-class SVM as well if some-one is looking to explore more.

If you like this post the detailed notebook can be found on Kaggle. Please consider leaving an upvote and comment if you like the work.

LinkedIn : https://www.linkedin.com/in/sawantsumeet/