Table of Contents
- What is a Random Forest?
- Step 1) Import Libraries and Data
- Step 2) Train a Random Forest
- Step 3) Evaluate Accuracy with Cross-Validation
- Step 4) Tune and Evaluate Random Forest
- Step 5) Visualize and Interpret Results
- Common Use Cases for Random Forests
- Random Forest Classification Example
- Random Forest Regression Example
- Summary
Random forests are one of the most popular and powerful machine learning algorithms for predictive modeling. They work exceptionally well with tabular data and yield high accuracy with little tuning required.
In this comprehensive R random forest tutorial, you will learn:
- What is a random forest and how does it work
- Why random forests are effective machine learning models
- How to train and evaluate random forest models in R
- Tips for optimizing random forest hyperparameters
- Visualizing variable importance from a trained random forest
- Common use cases and examples for random forests
And much more!
So let‘s dive in and explore the wonderful world of random forests in R.
What is a Random Forest?
A random forest is an ensemble machine learning algorithm that operates by constructing a multitude of decision trees during training.
Ensemble methods use multiple learning algorithms to obtain better predictive performance compared to a single model. While unstable learners like decision trees tend to overfit training data, ensembles can reduce variance without increasing bias.
Here is a high-level overview of how random forests work:
- Many decision trees are trained using a subset of rows and columns from the training dataset.
- Each decision tree makes a class prediction for unseen data.
- The predictions from all trees are aggregated through voting or averaging to make the overall random forest prediction.
By training many trees on random subsets of features and rows, the correlation between trees decreases and the trees become more robust to noise in training data.
The main hyperparameters that control random forests are:
- ntree: The number of trees. More trees give higher accuracy but reduce interpretability and increase training time.
- mtry: The number of variables randomly selected at each split. Lower values reduce overfitting but increase training time.
In practice, random forest hyperparameters require little tuning compared to neural networks and boosting algorithms. The defaults often yield good results.
Now let‘s see random forests in action with some R code!
Step 1) Import Libraries and Data
We will use two popular R packages for working with random forests:
- randomForest: Provides the RandomForest() training algorithm
- caret: Useful utilities for training models like cross-validation and hyperparameter tuning
First install and load these libraries:
install.packages("randomForest")
install.packages("caret")
library(randomForest)
library(caret)
For this analysis, we will use the Pima Indians diabetes dataset to predict diabetes onset based on diagnostic measurements. Let‘s import and explore the data:
diabetes <- read.csv("pima-indians-diabetes.csv")
str(diabetes)
summary(diabetes)
We can see there are 768 observations of women patients with 9 attributes like glucose level and BMI. The target is a binary diabetes onset variable.
Now the data is ready for modeling with random forests!
Step 2) Train a Random Forest
Using the randomForest package, it is straightforward to train a random forest model in R. We just need to specify a formula defining the variables and the number of trees:
set.seed(123)
rf_model <- randomForest(diabetes ~ ., data = diabetes, ntree = 500)
diabetes ~ .
uses all variables except the target to predict diabetesdata = diabetes
passes the training data framentree = 500
constructs 500 trees
That‘s it! Behind the scenes this:
- Constructed 500 decision trees on random subsets of rows and columns
- Made predictions by aggregating the trees through voting
- Estimated error rates and variable importance
Let‘s take a quick look at the trained randomForest object:
rf_model
Call:
randomForest(formula = diabetes ~ ., data = diabetes, ntree = 500)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 23.96%
Confusion matrix:
0 1 class.error
0 485 104 0.1763285
1 103 76 0.5753425
We can see estimates for the out-of-bag validation error, confusion matrix, and the number of variables sampled at each split. The default is the square root (sqrt()
) of total variables.
Now let‘s properly evaluate accuracy using cross-validation…
Step 3) Evaluate Accuracy with Cross-Validation
While the OOB estimate provides an initial sanity check, we should use cross-validation to reliably evaluate our model.
The train()
function from the caret package runs cross-validation for us:
set.seed(123)
control <- trainControl(method="cv", number=10)
rf <- train(diabetes ~ ., data = diabetes, method = "rf", trControl=control, ntree=500)
print(rf)
The output shows 78.1% accuracy estimated using 10-fold cross-validation. Not bad considering we used all default hyperparameters!
Length Class Mode
1 formulaNew 1 formula call
2 trControl 1 trainControl trainControl
3 method 1 character character
4 ntree 1 numeric numeric
5 metric 1 character character
---
Accuracy 0.781
Kappa 0.529
We could try to boost performance further by tuning hyperparameters like mtry and ntree.
Step 4) Tune and Evaluate Random Forest
Let‘s conduct a random hyperparameter search to find the optimal values for mtry and ntree:
tunegrid <- expand.grid(.mtry=c(5,10,15),
.ntree=c(100,500,1000))
set.seed(123)
rf_tuned <- train(diabetes ~., data = diabetes, method="rf", tuneGrid=tunegrid, trControl=control)
print(rf_tuned)
The model with mtry=15 and ntree=100 emerges as the optimal random forest based on accuracy, improving our performance to 78.9%:
mtry ntree Accuracy Kappa
15 100 0.7889474 0.5541948
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 15 and ntree = 100.
We achieved a small gain in accuracy at the expense of more complex hyperparameter tuning. In many cases, the default random forest works very well right off the bat.
But our journey isn‘t over yet! Next we will visualize the model to provide interpretation.
Step 5) Visualize and Interpret Results
A major advantage of tree-based models like random forests is that they allow for easy interpretation to understand why predictions were made.
Let‘s look at variable importance and partial dependence plots.
Variable Importance
The varImp()
function calculates statistics like Gini impurity and accuracy decrease to determine which variables were most important for model predictions. Higher numbers indicate greater importance.
importance <- varImp(rf_tuned)
plot(importance)
We can see plasma glucose concentration far outweighed any other variable in predicting diabetes onset, with BMI also playing a key role. This aligns with medical knowledge!
Understanding which variables drive predictions can provide insight into the fundamental relationships in your data.
Partial Dependence Plots
While variable importance tells us which variables are most influential overall, it lacks specifics on how the model uses those variables.
Partial dependence plots show the marginal effect of a variable on the class probability. They illustrate how predicted probability changes as values of each variable change, with other variables held at their average.
Let‘s plot the two most important variables to see their functional relationships:
plot(partialDependence(rf_tuned, pred.data = diabetes))
We clearly observe a positive relationship between glucose and diabetes likelihood, with risk rising rapidly over 125 mg/dL. BMI exhibits a more gradual positive correlation.
Together, variable importance and partial plots deliver actionable data insights!
Common Use Cases for Random Forests
Now that you understand the basics of training and interpreting random forests in R, where can they be applied?
Some of the most popular use cases include:
Tabular data modeling: Random forests excel at modeling numeric and categorical data in tables. With accurate performance and easy interpretation, they are a "go-to" tabular technique.
Handle many predictor variables: A major advantage over linear models is the innate ability to model many variables without overfitting. Random forests auto-select important variables.
Rank variable importance: The variable importance scores obtained from trained random forest models provide a robust statistical means to identify key predictive variables and relationships.
Impute missing values: During training, you can use mean/mode aggregation from leaf nodes to fill missing cells. This preserves valuable training samples that might otherwise get discarded from traditional imputation techniques.
Let‘s go through examples of two common applications – a classification and regression problem.
Random Forest Classification Example
For classification tasks, random forests make probabilistic predictions by aggregating votes across their decision trees for categorical target variables.
Let‘s walk through a binary classification example using random forests…
The Problem
We will build a model to predict credit risk – specifically if a person will default on their credit based on financial metrics and demographics.
The features provided in the dataset include:
- Checking account status
- Credit history
- Purpose
- Credit amount
- Savings account balance
- Employment years
And more…
The target is a binary variable indicating a good (1
) or bad credit risk (0
).
Load Data
We first load the credit dataset and convert categorical variables into dummy indicators:
credit <- read.csv("credit.csv")
library(dummies)
credit <- dummy.data.frame(credit, sep = "_")
Checking the data structure confirms our preprocessing:
str(credit)
‘data.frame‘: 1000 obs. of 41 variables:
$ checking_status_no_checking : num 1 0 0 0 0 1 0 1 0 0 ...
$ checking_status_less_0 : num 0 0 1 0 1 0 1 0 1 0 ...
$ checking_status_greater_0 : num 0 1 0 1 0 0 0 0 0 1 ...
$ history_critical : num 0 0 0 0 0 0 0 0 0 0 ...
$ history_good : num 1 0 0 0 0 1 0 1 0 1 ...
$ purpose_businness : num 0 0 0 1 0 0 0 0 1 1 ...
...
The data is now ready for modeling!
Train and Evaluate Model
We specify a formula using all features except the ID, set up 10-fold cross-validation, and train the random forest model:
set.seed(1)
f <- as.formula(default ~ . - ID)
control <- trainControl(method="cv", number=10)
rf <- train(f, data=credit, method="rf", trControl=control, ntree=100)
print(rf)
The output shows 87% accuracy – not bad right off the bat!
Length Class Mode
1 formulaAll 1 formula call
2 trControl 1 trainControl trainControl
3 method 1 character character
4 ntree 1 numeric numeric
---
Accuracy 0.872
Kappa 0.693
We could try tuning mtry and max nodes for further gains. But overall random forests provide an accurate "off the shelf" approach here.
And interpreting variable importance is straightforward using the methods shown earlier.
So in summary, random forests delivered excellent predictive performance for credit risk classification with this tabular data.
Random Forest Regression Example
For predicting continuous numeric targets like home prices, random forest regression models can be similarly effective.
Let‘s demonstrate with an example…
The Problem
Here our goal is to build a model that predicts Boston housing prices based on metrics like crime rate, proximity to highways, pupil-teacher ratios, etc.
This is a popular regression benchmark dataset.
Load Data
We start by loading the Boston housing data included in base R:
library(datasets)
data("Boston", package = "MASS")
Checking the structure:
str(Boston)
‘data.frame‘: 506 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
The target variable is the median home value medv
. And we have all the expected predictor variables related to home prices.
Train and Evaluate Model
As before, we set up 10-fold cross-validation with caret and train a random forest model:
f <- as.formula(medv ~ . -medv)
control <- trainControl(method="cv", number = 10)
rf <- train(f, data=Boston, method="rf", trControl = control)
print(rf)
The output displays an R^2 of 92% and RMSE of 3.3:
Length Class Mode
1 formulaNew 1 formula call
2 trControl 1 trainControl trainControl
3 method 1 character character
4 ntree 500 numeric numeric
5 mtry 1 numeric numeric
RMSE Rsquared
3.28 0.9243
So our random forest can explain over 92% of variance in Boston housing prices – not too bad!
And as before, variable importance provides interpretation to understand the main drivers. Geographic location and factors like crime rate and air pollution (nox) have the most influence according to dimensionality reduction techniques.
Summary
In this comprehensive R tutorial, you discovered the power and intuition behind random forest models, one of the most popular algorithms for predictive modeling of tabular data.
The key takeaways include:
- Random forests are ensemble methods that combine many decision trees to yield accurate predictions while avoiding overfitting inherent in single decision trees
- They require little data preprocessing and handle numerical plus categorical variables automatically
- Performance is fairly robust to tuning main hyperparameters like ntree and mtry compared to other machine learning algorithms
- Variable importance plots deliver straightforward interpretation to identify relationships in your data
- Use cases span classification and regression problems for structured data
So in your next modeling endeavour that involves tabular data, be sure to have random forests in your toolbox! Proper application of these techniques can elevate your predictive analytics to new heights.
To learn more about random forest models and practice hands-on in R, check out the following resources:
Tuning Random Forest Hyperparameters