R Random Forest Tutorial with Example

Random forests are one of the most popular and powerful machine learning algorithms for predictive modeling. They work exceptionally well with tabular data and yield high accuracy with little tuning required.

In this comprehensive R random forest tutorial, you will learn:

  • What is a random forest and how does it work
  • Why random forests are effective machine learning models
  • How to train and evaluate random forest models in R
  • Tips for optimizing random forest hyperparameters
  • Visualizing variable importance from a trained random forest
  • Common use cases and examples for random forests

And much more!

So let‘s dive in and explore the wonderful world of random forests in R.

What is a Random Forest?

A random forest is an ensemble machine learning algorithm that operates by constructing a multitude of decision trees during training.

Ensemble methods use multiple learning algorithms to obtain better predictive performance compared to a single model. While unstable learners like decision trees tend to overfit training data, ensembles can reduce variance without increasing bias.

Here is a high-level overview of how random forests work:

  1. Many decision trees are trained using a subset of rows and columns from the training dataset.
  2. Each decision tree makes a class prediction for unseen data.
  3. The predictions from all trees are aggregated through voting or averaging to make the overall random forest prediction.

By training many trees on random subsets of features and rows, the correlation between trees decreases and the trees become more robust to noise in training data.

The main hyperparameters that control random forests are:

  • ntree: The number of trees. More trees give higher accuracy but reduce interpretability and increase training time.
  • mtry: The number of variables randomly selected at each split. Lower values reduce overfitting but increase training time.

In practice, random forest hyperparameters require little tuning compared to neural networks and boosting algorithms. The defaults often yield good results.

Now let‘s see random forests in action with some R code!

Step 1) Import Libraries and Data

We will use two popular R packages for working with random forests:

  • randomForest: Provides the RandomForest() training algorithm
  • caret: Useful utilities for training models like cross-validation and hyperparameter tuning

First install and load these libraries:

install.packages("randomForest")  
install.packages("caret")

library(randomForest)
library(caret)

For this analysis, we will use the Pima Indians diabetes dataset to predict diabetes onset based on diagnostic measurements. Let‘s import and explore the data:

diabetes <- read.csv("pima-indians-diabetes.csv")
str(diabetes)
summary(diabetes)

We can see there are 768 observations of women patients with 9 attributes like glucose level and BMI. The target is a binary diabetes onset variable.

Now the data is ready for modeling with random forests!

Step 2) Train a Random Forest

Using the randomForest package, it is straightforward to train a random forest model in R. We just need to specify a formula defining the variables and the number of trees:

set.seed(123)

rf_model <- randomForest(diabetes ~ ., data = diabetes, ntree = 500) 
  • diabetes ~ . uses all variables except the target to predict diabetes
  • data = diabetes passes the training data frame
  • ntree = 500 constructs 500 trees

That‘s it! Behind the scenes this:

  1. Constructed 500 decision trees on random subsets of rows and columns
  2. Made predictions by aggregating the trees through voting
  3. Estimated error rates and variable importance

Let‘s take a quick look at the trained randomForest object:

rf_model

Call:
 randomForest(formula = diabetes ~ ., data = diabetes, ntree = 500) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 23.96%
Confusion matrix:
     0   1 class.error
0 485 104   0.1763285
1 103  76   0.5753425

We can see estimates for the out-of-bag validation error, confusion matrix, and the number of variables sampled at each split. The default is the square root (sqrt()) of total variables.

Now let‘s properly evaluate accuracy using cross-validation…

Step 3) Evaluate Accuracy with Cross-Validation

While the OOB estimate provides an initial sanity check, we should use cross-validation to reliably evaluate our model.

The train() function from the caret package runs cross-validation for us:

set.seed(123)

control <- trainControl(method="cv", number=10)  

rf <- train(diabetes ~ ., data = diabetes, method = "rf", trControl=control, ntree=500)

print(rf)

The output shows 78.1% accuracy estimated using 10-fold cross-validation. Not bad considering we used all default hyperparameters!

           Length Class      Mode     
1 formulaNew      1 formula    call     
2 trControl     1 trainControl trainControl
3 method         1 character   character
4 ntree         1 numeric     numeric   
5 metric        1 character   character 
---                
Accuracy       0.781
Kappa          0.529   

We could try to boost performance further by tuning hyperparameters like mtry and ntree.

Step 4) Tune and Evaluate Random Forest

Let‘s conduct a random hyperparameter search to find the optimal values for mtry and ntree:

tunegrid <- expand.grid(.mtry=c(5,10,15),  
                       .ntree=c(100,500,1000))

set.seed(123)                       
rf_tuned <- train(diabetes ~., data = diabetes, method="rf", tuneGrid=tunegrid, trControl=control)

print(rf_tuned)

The model with mtry=15 and ntree=100 emerges as the optimal random forest based on accuracy, improving our performance to 78.9%:

mtry ntree      Accuracy     Kappa
  15   100     0.7889474 0.5541948

Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were mtry = 15 and ntree = 100. 

We achieved a small gain in accuracy at the expense of more complex hyperparameter tuning. In many cases, the default random forest works very well right off the bat.

But our journey isn‘t over yet! Next we will visualize the model to provide interpretation.

Step 5) Visualize and Interpret Results

A major advantage of tree-based models like random forests is that they allow for easy interpretation to understand why predictions were made.

Let‘s look at variable importance and partial dependence plots.

Variable Importance

The varImp() function calculates statistics like Gini impurity and accuracy decrease to determine which variables were most important for model predictions. Higher numbers indicate greater importance.

importance <- varImp(rf_tuned)
plot(importance)

variable importance plot

We can see plasma glucose concentration far outweighed any other variable in predicting diabetes onset, with BMI also playing a key role. This aligns with medical knowledge!

Understanding which variables drive predictions can provide insight into the fundamental relationships in your data.

Partial Dependence Plots

While variable importance tells us which variables are most influential overall, it lacks specifics on how the model uses those variables.

Partial dependence plots show the marginal effect of a variable on the class probability. They illustrate how predicted probability changes as values of each variable change, with other variables held at their average.

Let‘s plot the two most important variables to see their functional relationships:

plot(partialDependence(rf_tuned, pred.data = diabetes))   

partial dependence plots

We clearly observe a positive relationship between glucose and diabetes likelihood, with risk rising rapidly over 125 mg/dL. BMI exhibits a more gradual positive correlation.

Together, variable importance and partial plots deliver actionable data insights!

Common Use Cases for Random Forests

Now that you understand the basics of training and interpreting random forests in R, where can they be applied?

Some of the most popular use cases include:

Tabular data modeling: Random forests excel at modeling numeric and categorical data in tables. With accurate performance and easy interpretation, they are a "go-to" tabular technique.

Handle many predictor variables: A major advantage over linear models is the innate ability to model many variables without overfitting. Random forests auto-select important variables.

Rank variable importance: The variable importance scores obtained from trained random forest models provide a robust statistical means to identify key predictive variables and relationships.

Impute missing values: During training, you can use mean/mode aggregation from leaf nodes to fill missing cells. This preserves valuable training samples that might otherwise get discarded from traditional imputation techniques.

Let‘s go through examples of two common applications – a classification and regression problem.

Random Forest Classification Example

For classification tasks, random forests make probabilistic predictions by aggregating votes across their decision trees for categorical target variables.

Let‘s walk through a binary classification example using random forests…

The Problem

We will build a model to predict credit risk – specifically if a person will default on their credit based on financial metrics and demographics.

The features provided in the dataset include:

  • Checking account status
  • Credit history
  • Purpose
  • Credit amount
  • Savings account balance
  • Employment years

And more…

The target is a binary variable indicating a good (1) or bad credit risk (0).

Load Data

We first load the credit dataset and convert categorical variables into dummy indicators:

credit <- read.csv("credit.csv")

library(dummies)
credit <- dummy.data.frame(credit, sep = "_")

Checking the data structure confirms our preprocessing:

str(credit) 
‘data.frame‘:   1000 obs. of  41 variables:
 $ checking_status_no_checking               : num  1 0 0 0 0 1 0 1 0 0 ...
 $ checking_status_less_0                    : num  0 0 1 0 1 0 1 0 1 0 ...
 $ checking_status_greater_0                 : num  0 1 0 1 0 0 0 0 0 1 ...
 $ history_critical                          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ history_good                              : num  1 0 0 0 0 1 0 1 0 1 ... 
 $ purpose_businness                         : num  0 0 0 1 0 0 0 0 1 1 ...
...

The data is now ready for modeling!

Train and Evaluate Model

We specify a formula using all features except the ID, set up 10-fold cross-validation, and train the random forest model:

set.seed(1)
f <- as.formula(default ~ . - ID)

control <- trainControl(method="cv", number=10)

rf <- train(f, data=credit, method="rf", trControl=control, ntree=100)

print(rf)

The output shows 87% accuracy – not bad right off the bat!

           Length Class     Mode     
1 formulaAll      1 formula call     
2 trControl     1 trainControl trainControl
3 method         1 character character
4 ntree         1 numeric   numeric  
---
Accuracy       0.872  
Kappa          0.693   

We could try tuning mtry and max nodes for further gains. But overall random forests provide an accurate "off the shelf" approach here.

And interpreting variable importance is straightforward using the methods shown earlier.

So in summary, random forests delivered excellent predictive performance for credit risk classification with this tabular data.

Random Forest Regression Example

For predicting continuous numeric targets like home prices, random forest regression models can be similarly effective.

Let‘s demonstrate with an example…

The Problem

Here our goal is to build a model that predicts Boston housing prices based on metrics like crime rate, proximity to highways, pupil-teacher ratios, etc.

This is a popular regression benchmark dataset.

Load Data

We start by loading the Boston housing data included in base R:

library(datasets)  

data("Boston", package = "MASS")

Checking the structure:

str(Boston)
‘data.frame‘:   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The target variable is the median home value medv. And we have all the expected predictor variables related to home prices.

Train and Evaluate Model

As before, we set up 10-fold cross-validation with caret and train a random forest model:

f <- as.formula(medv ~ . -medv)

control <- trainControl(method="cv", number = 10)  

rf <- train(f, data=Boston, method="rf", trControl = control)   

print(rf)

The output displays an R^2 of 92% and RMSE of 3.3:

           Length Class     Mode     
1 formulaNew      1 formula call     
2 trControl     1 trainControl trainControl
3 method         1 character character
4 ntree       500 numeric   numeric   
5 mtry           1 numeric   numeric   

RMSE      Rsquared 
  3.28     0.9243

So our random forest can explain over 92% of variance in Boston housing prices – not too bad!

And as before, variable importance provides interpretation to understand the main drivers. Geographic location and factors like crime rate and air pollution (nox) have the most influence according to dimensionality reduction techniques.

Summary

In this comprehensive R tutorial, you discovered the power and intuition behind random forest models, one of the most popular algorithms for predictive modeling of tabular data.

The key takeaways include:

  • Random forests are ensemble methods that combine many decision trees to yield accurate predictions while avoiding overfitting inherent in single decision trees
  • They require little data preprocessing and handle numerical plus categorical variables automatically
  • Performance is fairly robust to tuning main hyperparameters like ntree and mtry compared to other machine learning algorithms
  • Variable importance plots deliver straightforward interpretation to identify relationships in your data
  • Use cases span classification and regression problems for structured data

So in your next modeling endeavour that involves tabular data, be sure to have random forests in your toolbox! Proper application of these techniques can elevate your predictive analytics to new heights.

To learn more about random forest models and practice hands-on in R, check out the following resources:

Random Forest Algorithm in R

Random Forest Project in R

Tuning Random Forest Hyperparameters

Read More Topics