In the last post, I introduced Intel Distribution of Modin and Intel Extension for Scikit-learn, integral parts of the Intel oneAPI AI Analytics Toolkit, and the overall Intel AI Software suite.

Let’s take a closer look at Modin and Scikit-learn extensions through this tutorial. The objective of this guide is to highlight how Modin and Scikit-learn extensions are a drop-in replacement for stock Pandas and Scikit-learn libraries. You can download the Jupyter Notebook from GitHub to try this either in Intel DevCloud or your workstation.

For this tutorial, I provisioned an e2-standard-4 VM on Google Compute Engine with 4 vCPUs and 16GB RAM based on the Intel Broadwell platform. It comes with Python 3.8 preinstalled which I used as the runtime for this project.

We will train a model to detect a fraudulent transaction based on the Fraud Transaction Detection dataset from Kaggle. It’s a ~500MB CSV file with over 6 million rows of data making it an ideal candidate for Modin. This gives us a chance to compare the load times of Modin vs. Pandas. Before starting the project, download the dataset and copy it to the training environment.

The training algorithm is based on Nearest Neighbors, an unsupervised machine learning technique to train both classification and regression models. We will train the model twice with stock Scikit-learn and Intel Extension for Scikit-learn to measure the speed and performance.

Step 1: Configuring the Environment

Let’s start by installing pip and the required modules.

Now, install Intel Distribution of Modin, Intel Extension for Scikit-learn, and Jupyter.

Launch Jupyter Notebook and access it from the browser.

Step 2: Loading the Dataset and Measuring Performance

With the CSV file uploaded to your training environment, let’s load it into Modin and Pandas.

As we load the dataset, we also measure the time taken by adding the %timeit magic function at the beginning of the cell.

In my environment, Pandas took ~12 seconds while Modin loaded the same dataset in ~6 seconds.

Intel Distribution of Modin accelerates loading the dataset with 2x speed. When using large datasets, Modin delivers even more significant performance improvements.

Step 3: Preparing and Preprocessing the Dataset

Irrespective of how we loaded the dataset, we need to prepare and preprocess it to make it useful for the training.

First, we will drop the columns that are not relevant and useful.

The type column in the dataset has five categories:

Let’s encode them into integers.

Finally, we will perform One Hot Encoding to convert them into categorical columns and append them to the original dataset, and delete the original column.

Since some of the values in the dataset are null, we will perform data imputation by replacing them with zeros.

The dataset is now ready for training.

Step 4: Training the Model and Measuring the Performance

Before kicking off the training process, let’s separate the features and labels and then split the data into train and test datasets.

This creates a test dataset with 30% of data and remaining for training.

First, let’s train the model with Sckit-learn and measure the performance.

Once it is done, we will repeat the step with Intel Extension for Scikit-learn. Notice that we are explicitly loading the sklearnex module and importing NearestNeighbors.

In my environment, stock scikit-learn took 23.8 seconds while Intel Extension for Scikit-learn finished training in only 5.72 seconds, a speedup of over 4X. Though the results may vary on your machine, it is evident that Intel Extension for Scikit-learn is significantly faster than stock Scikit-learn. It accelerates training on general-purpose x86 CPUs without the need for expensive AI accelerators such as GPUs and FPGAs.

Featured image via Pixabay.