How to optimise machine learning pipelines with TPOT? – Analytics India Magazine

In recent years, we have witnessed the emergence of automated machine learning(AutoML). Various tools provide this facility and they have their own advantages that make the process of AutoML easy and fast. Similarly, TPOT is a library that provides the facility of AutoML with an amazing feature of providing the whole code of the final model. In this article, we are going to discuss the TPOT package of python and will demonstrate how it can be applied to the optimization of machine learning pipelines. The major points to be discussed in the articles are listed below.

  1. What is TPOT?
  2. Implementation
    1. Data loading 
    2. Modelling 
    3. Checking accuracy 
    4. Pipeline creation  

TPOT is an open-source package for optimizing machine learning programs. The name TPOT can be segregated as a Tree-based Pipeline Optimization Tool. We can also consider this tool as an automated machine learning tool using which we can automatically get a high-performing model for data modelling along with this TPOT is also capable of optimizing machine learning pipelines using genetic programming.  

Pipeline optimization capability makes the TPOT different from the other packages of automated machine learning(AutoML) packages. Using the TPOT we can perform machine learning pipeline optimization in the following steps:

Step 1

This step is similar to the other AutoML package procedures where a clean and tidy dataset can be given to the AutoML model and it finds the optimized model by going through a possible set of models. This step can be explained using the following flow chart.

Image source

Step 2

This step is the main feature of the TPOT where an optimized model comes out with the whole code that is the pipeline of the final model. That can tell you how the final model can be designed. The below flow chart explains this step.

Image source

One more thing that is very important about this package is that it is built using popular libraries of python like sklearn so the final codes provided by this package will look very familiar to the new learner of data science. This package can be installed using the following lines of codes.

!pip install tpot

After the installation, we are ready to get an optimized pipeline using TPOT.

In this article, we try to model a randomly generated dataset using the sklearn libraries make_classification module. Let’s make the random data: 

Data loading 

from sklearn.datasets import make_classification
X, y = make_classification(n_samples = 150, n_features=4, n_classes=3,n_clusters_per_class=1)

Let’s check how our data is generated:

print("sample in feature one -", X[:,0].size)
print("sample in feature two -", X[:,1].size)
print("sample in feature three -", X[:,2].size)
print("categories in target variable -", np.unique(y))

Output:

Here we can see that we have 150 samples in each variable and 3 categories in the target variable.

Let’s plot the data.

import matplotlib.pyplot as plt
plt.scatter(X[:,0],X[:,1],c=X[:,2],s=25)
plt.show()

Output:

Here we can see the distribution of the data. Since it is randomly generated, its distribution can be proper. In the above, we have seen the distribution according to the categorical values. We can also check the distribution of the features.

plt.plot(X[:, 0], 'bo',ms=5)
plt.plot(X[:, 1], 'ro',ms=5)
plt.plot(X[:, 2], 'go',ms=5)
plt.show()

Output:

Here is a proper view of the features. Now after making the data we are ready to model this data.

Modelling    

In this article, we aim to model randomly generated data using a basic model from the TPOT library. Before modelling the data we are required to split the data that can be done in the following way:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.astype(np.float64), y.astype(np.float64), train_size=0.75, test_size=0.25, random_state=42)

 In the above one thing, we can see that we have converted the data into a NumPy array, this is one requirement that we need to fulfil to use the TPOT library. Now after splitting the data we can model the data into a pipeline of TPOT in the following way:

tpot = TPOTClassifier(generations=10, population_size=30, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)

Output:

Here we can see that we have used the 10 generations for and 30 as the size of the population. Let’s check the accuracy of the suggested model from the pipeline.

Checking accuracy 

After modelling we are required to check the suggested models using the test data. We can check the model using the accuracy score that can be done in the following way:

print(100 * tpot.score(X_test, y_test))

Output:

Here we can see that we have good accuracy from our final model.

Pipeline creation 

In the above, we have gone through the first step of the TPOT which is very similar to the other AutoML tools. In this step, we will look at how TPOT is different from other tools. Using this package we can get the codes of the whole modelling procedure that will be required to make the suggested model. We can get these codes in the following ways.

tpot.export('tpot_pipeline.py')

The above code will make a file to be stored in our working directory. The code that I have got is in the following:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive # NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \ train_test_split(features, tpot_data['target'], random_state=42) # Average CV score on the training set was: 0.9739130434782609
exported_pipeline = make_pipeline( StackingEstimator(estimator=MLPClassifier(alpha=0.1, learning_rate_init=0.01)), RandomForestClassifier(bootstrap=False, criterion="entropy", max_features=1.0, min_samples_leaf=10, min_samples_split=20, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42) exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Here in the above, we can see that most of the libraries TPOT is using for the code generation are the basic libraries like sklearn and pandas. We also have the model that has given the highest results with our data. We just need to give the path of the data and we can push the final code in the main project where we don’t need to waste our time in the cross-validation of hyperparameter tuning.

Final words

In this article, we have discussed the TPOT library which is a library for automated machine learning and using this library we can also get the whole code or pipeline design for the finalized model. This helps in optimizing the machine learning pipelines and makes the process faster and more effective.

Spread the love

Leave a Reply

Your email address will not be published.