18 Jun Extreme Gradient Boosting
Predictive analytics provides mechanisms to estimate data that are uncertain and require costly processing to obtain. Organizations use this tool to help solve difficult problems, and among its most common uses is the detection of fraud, which is increasing every day.
Predictive analytics involves the application of statistical analysis techniques and machine learning algorithms to data sets. We also use it to create predictive analysis models that help us calculate the probability of a particular event occurring.
This analytical study uses binary classification models, which are able to predict which of the two categories, fraud or non-fraud, a new instance will belong to.
Over the last few years, different automatic learning algorithms for classification have been developed: Decision Tree, Logistic Regression and K Nearest Neighbors are just a few that are used to detect all types of fraud.
How are predictive models evaluated?
The evaluation metrics of predictive models are performed by measuring their performance. A division of the data is carried out, creating a first set, which we will use for the training of the model, and a second set, for the evaluation of the model. The first one will be used to implement the model, and the second one, as its name indicates, will be used as a test for testing purposes.
Once the model is made, the confusion matrix is generated. This allows us to tabulate the results we get when evaluating the test group with the generated model. The aim is to have a visual representation of the errors and successes of our model: how good or how bad it is.
Fraud (predicted) | No fraud (predicted) | |
Fraud (real) | True positives (TP) | False negatives (FN) |
No fraud (real) | False positives (FP) | True negatives (TN) |
Table 1. Confusion matrix for fraud detection.
The confusion matrix has two pairs of metrics, precision and accuracy, and sensitivity and specialty. The first pair tells us the distribution of the predicted data and how close or far they are from our actual data. The second pair informs us about the capacity of our model to distinguish the positive cases, which we have succeeded, and the negative cases, which we have failed.
If we focus on the first pair, we can obtain a numerical value that represents the measure of how good our model is, this is known as f-score.
The f-score value
One of the uses of the f-score value is to compare different classification algorithms using the same data set. The higher this value is, the higher the performance of the model and, therefore, the greater predictive capacity of the model.
There are many studies comparing different binary classification algorithms based on the performance of each model. The Extreme Gradient Boosting (XGB) algorithm is recognized as having exceptional predictive capabilities.
At PhishingHunters we have conducted a study on the detection of fraudulent credit card transactions. In this project, several predictive algorithms have been used to see how accurate they are in detecting whether a transaction is a normal payment or a fraud. The attached table shows the f-score values of seven different Machine Learning algorithms used in the project. The value of this variable of the non-fraud class is very close to 1, which gives it a higher predictive value in relation to the rest of the algorithms used.
Machine Learning Algorithm | F-score (fraud) |
Decision Tree | 0.77 |
Logistic Regression | 0.74 |
K Nearest Neighbors | 0.83 |
Support Vector Machine | 0.84 |
Random Forest | 0.85 |
Extreme Gradient Boosting | 0.86 |
Artificial Neural Networks | 0.82 |
Table 2. Fraud class f-score value for each of the Machine learning algorithms
As you can see in Table 2, the Extreme Gradient Boosting algorithm is the one with the highest f-score value and, therefore, it is the one with the best predictive capacity.
We can confront in a table two interesting data: in one column we represent the percentage of detected fraud, how many cases of fraud have been detected, and in the adjacent column we note the percentage of probability of being fraud, that is to say with what certainty we say that a case is fraudulent or not.
Machine Learning Algorithm | Percentage of fraud detected (%) | Probability of fraud (%) |
Decision Tree | 76.20 | 77.78 |
Logistic Regression | 66.67 | 83.76 |
K Nearest Neighbors | 74.15 | 93.16 |
Support Vector Machine | 74.15 | 96.46 |
Random Forest | 78.23 | 93.50 |
Extreme Gradient Boosting | 78.23 | 94.26 |
Artificial Neural Networks | 76.87 | 88.28 |
Table 3. Percentage of fraud detected and probability of being fraud for each of the algorithms.
In Table 3 you can see that the algorithms that detect the highest percentage of fraud are Random Forest and Extreme Gradient Boosting, but the latter detects the frauds that are more likely to actually be fraud than the former. It can also be seen that the Support Vector Machine algorithm detects fraud cases more likely than any other algorithm.
Comparing the Support Vector Machine algorithm with the Extreme Gradient Boosting algorithm, it should be noted that it is more important for organizations to detect fraud, taking into account that the difference between probabilities is practically not significant.
We’ll keep you posted, so stay tuned…