What’s happening in Data? – Packt Hub

Pro tools for Pros: Industry leading observability capabilities for Dataflow: Dataflow is leading the batch and streaming data processing industry with the best-in-class observability experiences.  It offers native integration with Google Cloud Error Reporting to help you identify and manage errors that impact your job’s performance.                                           
                          Forwarded this email? Sign up here
         

Understanding Machine Learning Algorithms

How to implement Cross-Validation in Python – By Stefan Jansen 

We will illustrate various options for splitting data into training and test sets. We’ll do this by showing how the indices of a mock dataset with 10 observations are assigned to the train and test set, as shown in the following code: 

data = list(range(1, 11)) 

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10] 

Scikit-learn’s CV functionality, which we’ll demonstrate in this section, can be imported from sklearn.model_selection. 

For a single split of your data into a training and a test set, use train_test_split, where the shuffle parameter, by default, ensures the randomized selection of observations. You can ensure replicability by seeding the random number generator by setting random_state. There is also a stratify parameter, which ensures for a classification problem that the train and test sets will contain approximately the same proportion of each class. The result looks as follows: 

train_test_split(data, train_size=.8) 

[[8, 7, 4, 10, 1, 3, 5, 2], [6, 9]] 

train_test_split(data, train_size=.8) 

[[8, 7, 4, 10, 1, 3, 5, 2], [6, 9]] 

KFold iterator 

The KFold iterator produces several disjunct splits and assigns each of these splits once to the validation set, as shown in the following code: 

kf = KFold(n_splits=5) 

for train, validate in kf.split(data): 

    print(train, validate) 

[2 3 4 5 6 7 8 9] [0 1] 

[0 1 4 5 6 7 8 9] [2 3] 

[0 1 2 3 6 7 8 9] [4 5] 

[0 1 2 3 4 5 8 9] [6 7] 

[0 1 2 3 4 5 6 7] [8 9] 

In addition to the number of splits, most CV objects take a shuffle argument that ensures randomization. To render results reproducible, set the random_state as follows: 

kf = KFold(n_splits=5, shuffle=True, random_state=42) 

for train, validate in kf.split(data): 

    print(train, validate) 

[0 2 3 4 5 6 7 9] [1 8] 

[1 2 3 4 6 7 8 9] [0 5] 

[0 1 3 4 5 6 8 9] [2 7] 

[0 1 2 3 5 6 7 8] [4 9] 

[0 1 2 4 5 7 8 9] [3 6] 

This explainer on ML algorithms was curated from the Book – Machine Learning for Algorithmic Trading – Second Edition. To explore more, click the button below!

                                                Read Here!

Spread the love

Leave a Reply

Your email address will not be published.