Deep learning Keras autoencoder implements credit card fraud detection modeling

Today, I planned to fly to Guangzhou. As a result, all flights were cancelled after arriving at the airport. I had to return to school to have lunch and go to the laboratory to communicate with my classmates about the latest situation, and calm down to write about deep learning modeling experience.

I saw a foreign blog about using Keras's autoencoder model as a classification model for bank credit card fraud. I was more interested, because I used data mining modeling to make similar models in the past. Now let's see how to use Deep Autoencoder to play the model.

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Establishing bank customer credit card fraud models traditionally mainly rely on data mining modeling techniques, generally using statistical modeling or machine learning algorithms, such as Rogester regression logistics, discriminant analysis and other multivariate statistical methods, or decision trees, support vector machine SVM , Bayesian network, KNN, Neural Network Nearul Network, etc. The neural network modeling here is mainly a single hidden layer machine learning algorithm.

Recently, deep learning technology has become popular, especially in the fields of image recognition, autonomous driving, machine translation, and game Player. In particular, after google open sourced the tensorflow library, coupled with Keras's open source top-level framework, the application value and potential application scenarios of deep learning modeling technology are broad and far-reaching. Especially the recent popularity of artificial intelligence AI highlights the development of deep learning technology. Technical iteration supported by data.

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Recently, I started to learn the basic algorithms and modeling techniques of deep learning. I deeply feel that in the context of big data, I don’t like to use statistical techniques with machine learning. Who uses machine learning with deep learning? Of course this is ridicule, but powerful deep learning may bring a revolution in algorithms!

Next, I will demonstrate the use of Keras' automatic decoder model (auotencoding) to analyze the fraud detection model of bank credit cards.

It is recommended that your system be installed in advance:

1-Python2.7, it is recommended that Anaconda install the Python environment, it will automatically install many dependent packages.

2-Tensorflow package, google open source, currently the most popular deep learning package.

The 3-Keras package supports Tensorflow and Theano as backends. I always choose Tensorflow.

I prefer to use Jupyter Notebook for interactive programming.

First we load the various Python packages we need:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Using TensorFlow backend.

Deep learning relies heavily on various array operations such as numpy, sklearn, pandas, scipy, and mathematical matrix algorithms.

Load bank credit card data, I store it in the startup directory'data/creditcard.csv'

Load data into Pandas data format

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

The data source comes from Kaggle. The watchdog kaggle is the home of data science. There are a lot of rewarding modeling projects on it. Those who like data science should set up an account as soon as possible. There is a large amount of open source data, and at the same time, you can see the world modeling experts. Source code, you can even run code directly on it, mainly Python and R languages.

The data source contains 284,807 transaction records over two days, of which 492 transactions are marked as fraud.

Special note: There are 25 numerical independent variables v1 to v28 that affect fraud. We can’t see the original data, but 25 principal component variables generated after PCA principal component analysis. This gave me an inspiration: this is also An important means of data desensitization technology, if it involves data privacy in the future, it can be submitted to a third party after PCA conversion.

The other two variables have not changed, transaction Time and transaction amount Amount;

Where Time is the interval (in seconds) between the transaction and the previous transaction;

Let’s take a brief look at the data structure: 31 columns of variables

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

There is also no missing value.

The last column is Class: 1-Fraud, 0-Normal

Let's look at the distribution of the target variable class:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

It can be seen from the figure that the granularity of fraud categories is very unbalanced, and the proportion of fraud in more than 280,000 transactions is very small. Of course, this is normal. If 10% of fraudulent banks are estimated to go bankrupt, hehe.

In traditional data mining modeling, over-sampling or cross-validation techniques are often needed to verify the model for this situation.

I first perform descriptive statistical analysis to look at the data structure and statistics

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Use the transaction amount to analyze the class classification chart:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Use the trading timeline to analyze the class classification chart:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Deep learning modeling of autoencoder: Autoencoder

Autoencoder modeling is a special deep learning algorithm that mainly uses the following functions:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Through the input data, abstract the data structure, compress the multi-dimensional or high-dimensional data into low-dimensional data representation, perform the encoder encoding, and then input the compressed dimensionality reduction data as a decoder for performance, so that it more accurately represents the original output result. Here It is the explanation that can better represent the correctness of the Class classification.

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Deep learning algorithms are mainly mathematical knowledge. It is necessary to review some basic knowledge of linear algebra and advanced data, such as differentiation, derivation, matrix transformation, mapping, function transformation, etc.

The visualized autoencoder model is as follows:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Including input layer, encoder network, decoder network, output layer.

Model optimization and parameter adjustment:

The important feature of deep learning is that the input layer passes through the hidden layer and the output layer, which is called forward propagation. In order to optimize the solution or weight, it is necessary to continuously propagate backward to adjust the weight, and modify the parameters to expect the optimal solution. That is to minimize the input error and the reconstruction error after output.

The traditional square of absolute error is used here:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Preparing for deep learning Autoencoder autoencoder data:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

When we use the deep learning algorithm, we no longer need the Time variable. At the same time, we introduce the StandardScaler function of sklearn to normalize the transaction amount data to (-1, 1).

It is emphasized here that deep learning generally requires data standardization, between (0,1) or (-1,1).

Split the data into train and test data sets: train=0.80, test=0.20

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

The random seed is set at the beginning so that the modeling can be repeated.

In the split data set, we also dropped the class variable. On the one hand, it can be seen that the autoencoder algorithm Autoencoder is actually a special unsupervised algorithm, or a semi-supervised algorithm.

Build the Autoencoder model:

The autoencoder establishes 4 fully connected Dense layers, respectively, with 14, 7, 7, 29 neurons. The first two layers are used for encoder encode, and the last two layers are used for decoder decode. L1 regularization will be used during training

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Deep learning modeling is generally to build the neural network model framework first, and then load the data set when compiling and fitting.

The encoder and decoder layers use "tanh" and "relu" activation functions, respectively.

Model training set 100 epochs, bitch batch size of 32 samples, the best performance model and saving the check-point points to a file. The ModelCheckpoint provided by Keras is very convenient for these tasks. In addition, the training progress will be exported in a format understood by TensorBoard.

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

The final model is completed after about 50 minutes of training! Reload the stored model.h5 data set.

I have been considering whether to buy a computer with GPU. A lot of big data is enough to train the CPU, but I should first register on AWS and try the cloud computing mode using GPU.

Evaluate some model effects:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

The loss of the model is basically maintained below 0.76, and it can be seen that after 100 epochs iterations, it converges well.

Predict the situation of the test data set:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

The reconstructed decoder predicts that the average error of the test set is about 0.73.

Reconstruct the error distribution without fraud records:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Reconstruct the error distribution containing fraud records:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

ROC model evaluation chart:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

The ROC model evaluation curve is very good, indicating that the model is effective and can detect fraudsters with an accuracy of 95.83%; ROC is the cumulative risk evaluation curve.

Precision and recall rate:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

In order to better understand what precision and recall are, the relevance of precision measurement results is a measure of how many related results are recalled. Both of these values can take values between 0 and 1. Of course the value = 1 is best.

Generally high recall but low precision means many results, most of which have low or no correlation. When the accuracy is high but the memory is low, the opposite returns and the correlation is high. Ideally, high precision and high recall rate are required. (Similar to a wrong classification matrix)

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

The accuracy of different thresholds:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

Recall rate of different thresholds:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

In order to predict whether the new/invisible credit card transaction is normal or fraudulent, we can calculate the reconstruction error from the transaction data itself. If the error is greater than a predetermined threshold, we mark it as fraud

(The model we expect should have a low error in normal transactions), set the threshold: threshold=2.9, and look at the forecast:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal channels of agents to ensure the original authenticity, which perfectly solves the problem of industrial product sample procurement and small-batch procurement for the majority of engineers and purchasing personnel.

A more intuitive error classification matrix:

Encoder|Transmitter|Soft starter|Power supply|VFD|Light curtain|Servo products|Human-machine interface|Solenoid valve|Cylinder|Actuator|Flow meter|Transformer|Thyristor|Sensor|Deceleration drive gearbox reducer|Linear guide|Circuit breaker|Refrigeration compressorAll industrial products of okplazas are purchased from the original factory or formal                     <div class=