Implementing Random Forest Regression in Python: An Introduction – Built In

In this tutorial we’ll try to understand one of the most important algorithms in machine learning: random forest algorithm. We’ll look at what makes random forest so special and implement it on a real-world data set using Python. You can find the code along with the data set here.

But first things first, let’s get some background.

What Is Random Forest Regression?

Random forest is a supervised learning algorithm that uses an ensemble learning method for classification and regression. Random forest is a bagging technique and not a boosting technique. The trees in random forests run in parallel, meaning is no interaction between these trees while building the trees.

Ensemble Learning

An ensemble method is a technique that combines the predictions from multiple machine learning algorithms together to make more accurate predictions than any individual model. A model comprised of many models is called an ensemble model.                            

random-forest-python
Ensemble Learning Method

What Are The Different Kinds of Ensemble Learning?

  1. Boosting
  2. Bootstrap Aggregation (Bagging)

More From Afroz ChakureWhat Is Decision Tree Classification?

Boosting

Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into stronger learners. Boosting is all about teamwork. Each model that runs, dictates the features on which the next model will focus.

As the name suggests, boosting means one is learning from another, which in turn boosts the learning.

Bootstrap Aggregation (Bagging)

Bootstrap refers to random sampling with replacement. Bootstrap allows us to better understand the bias and the variance within the data set. Bootstrapping involves a random sampling of a small subset of data from the data set.

Bagging is a general procedure that can be used to reduce the variance for those algorithms that have high variance, typically decision trees. Bagging makes each model run independently and then aggregates the outputs at the end without preference to any model.

[embedded content]

Machine Learning Tutorial Python — Random Forest

Problems with Decision Trees

Decision trees are sensitive to the specific data on which they are trained. If the training data is changed, the resulting decision tree can be quite different and, in turn, the predictions can be distinct.

Decision trees are also computationally expensive to train, carry a big risk of overfitting and tend to find local optima because they can’t go back after they have made a split.

To address these weaknesses, we turn to random forest, which illustrates the power of combining many decision trees into one model.

Learn More With Built In’s Data Science ExpertsA Beginner’s Guide to Evaluating Classification Models in Python

Random Forest

random-forest-python

Random Forest Structure

Random forest is a supervised learning algorithm that uses an ensemble learning method for classification and regression.

Random forest is a bagging technique and not a boosting technique. The trees in random forests run in parallel, meaning is no interaction between these trees while building the trees.

Random forest operates by constructing a multitude of decision trees at training time and outputting the class that’s the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Why Do We Use Random Forest Algorithm?

  • It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier.
  • It runs efficiently on large databases.
  • It can handle thousands of input variables without variable deletion.
  • It gives estimates of what variables are important in the classification.
  • It generates an internal unbiased estimate of the generalization error as the forest building progresses.
  • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.

A random forest is a meta-estimator (i.e. it combines the result of multiple predictions), which aggregates many decision trees with some helpful modifications:

  1.  The number of features that can be split at each node is limited to some percentage of the total (which is known as the hyper-parameter). This limitation ensures that the ensemble model does not rely too heavily on any individual feature and makes fair use of all potentially predictive features.
  2. Each tree draws a random sample from the original data set when generating its splits, adding a further element of randomness that prevents overfitting. 

The above modifications help prevent the trees from being too highly correlated.

For example, see the nine decision tree classifiers below:

random-forest-python
Nine different decision tree classifiers

We can aggregate these decision tree classifiers into a random forest ensemble which combines their input. Think of the horizontal and vertical axes of the above decision tree outputs as features x1 and x2. At certain values of each feature, the decision tree outputs a classification of blue, green, red, etc.

The above results are aggregated, through model votes or averaging, into a single ensemble model that ends up outperforming any individual decision tree’s output.

You can see the aggregated result for the nine decision tree classifiers below:  

random-forest-python
Random forest ensemble for the above decision tree classifiers

Disadvantages of Random Forest

  • Random forests have been observed to overfit for some data sets with noisy classification/regression tasks.
  • For data including categorical variables with different numbers of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.

More Tutorials on Built InA Primer on Model Fitting

Implementing Random Forest Regression 

1. Importing Python Libraries and Loading our Data Set into a Data Frame

random-forest-python

2. Splitting our Data Set Into Training Set and Test Set

This step is only for illustrative purposes. There’s no need to split this particular data set since we only have 10 values in it.

random-forest-python

3. Creating a Random Forest Regression Model and Fitting it to the Training Data

For this model I’ve chosen 10 trees (n_estimator=10).

random-forest-python

4. Visualizing the Random Forest Regression Results

random-forest-python

There you go. We’ve learned about the various kinds of ensemble learning algorithms and how these algorithms help make random forest work. Now we have an intuitive understanding of the features and advantages of using random forest over other machine learning algorithms.   

Spread the love

Leave a Reply

Your email address will not be published.