Building Decision Trees in R: A Guide for Classification and Regression

Hi there! Decision trees are one of my favorite machine learning methods because they transform data into clear decision rules that anyone can understand.

In this comprehensive guide, we‘ll walk through examples of building both classification and regression trees in R using the rpart package. I‘ll explain all the core concepts along the way.

How Do Decision Trees Work?

Decision trees are predictive modelling approaches that map out decisions in a flowchart-like tree structure. By learning simple decision rules, they can predict target outcomes from data features.

The goal when constructing a tree is to maximize homogeneity or "purity" at the leaf nodes. For classification, this means maximizing the proportion of observations of the same class within each leaf. For regression, it means minimizing variance around the mean value in each leaf node.

But how do decision trees split up the feature space to achieve this high purity at leaves? They use metrics like Gini Impurity and Information Gain…

Gini Impurity measures the probability of incorrect classification for a random variable in a node. A higher value means more heterogeneity in labels.

Information Gain quantifies how much additional information (or "purity") we would get by performing a specific split. It is used to decide which feature and value to split on at each step.

Let‘s understand this through a simple example with just 2 features (Age and Fare) for 50 passengers:

Table 1: Sample Training Data

Passenger Survived Age Fare
1 No 18 $100
2 No 26 $50
3 Yes 28 $200
. . . .

The root node has mixed survival, spanning all ages and fares.

To determine the optimal first split, information gain is calculated for splits on Age and Fare. The split with highest gain (most purity increase) is selected.

Let‘s assume Age <= 25 vs Age > 25 has an information gain of 0.15 while the gain from splitting on Fare is 0.05. So we split on Age first.

We repeat this recursive process on the child nodes, selecting optimal cuts until leaf nodes reach high purity or additional splits do not increase information gain.

This forms hierarchical decisions rules that segment the population based on feature patterns related to the outcome.

Now that you have some intuition, let‘s start building!

Step 1: Prepare the Titanic Data

For our first tree, we will model a classification problem – predicting whether or not a Titanic passenger survived from other attributes in the dataset.

First I‘ll load libraries and data:

library(rpart)
library(rpart.plot)  

titanic <- read.csv(‘http://bit.ly/kaggletrain‘)
str(titanic)

This prints out the feature names and data types. We have 12 features and 891 passengers.

Let‘s remove unnecessary fields and set Survived as a factor (categorical variable):

titanic$Survived <- as.factor(titanic$Survived) 

titanic <- subset(titanic, select=-c(PassengerId, Name, Cabin, Ticket))

I want to split the 891 passengers into a training set (70% of rows) to build the model and a test set (30% of rows) to evaluate model performance. This simulates how our model would be applied in real life.

set.seed(123)

train_index <- sample(1:nrow(titanic), size = 0.7*nrow(titanic))

train <- titanic[train_index, ] 
test <- titanic[-train_index, ]

Let‘s inspect the training data using the str() and summary() functions:

‘data.frame‘:   623 obs. of  8 variables:  
 $ Survived: Factor w/ 2 levels "0","1": 1 2 1 1 2 1 2 2 1 2 ...  
 $ Pclass  : int  3 1 3 1 3 2 3 3 1 1 ...
 $ Sex     : Factor w/ 2 levels "female","male": 2 1 2 2 1 1 1 1 2 1 ... 
 $ Age     : num  2 38 31 ...
 $ SibSp   : int  1 1 0 1 0 0 0 1 0 1 ...
 $ Parch   : int  2 0 0 0 0 0 0 1 0 0 ...
 $ Fare    : num  41.1 120 7.75 55 16.7 ... 
 $ Embarked: Factor w/ 3 levels "C","Q","S": 2 3 2 2 2 3 2 2 2 2 ...

The training set has 623 passenger records while the test set has 268 records. Time to build our first tree!

Step 2: Building a Classification Tree

The rpart package in R provides the rpart() function for creating decision tree models.

Let‘s build a classification tree to predict the survived label:

tree_model <- rpart(Survived ~ .,    
                   data = train,  
                   method = "class")

The formula Survived ~ . specifies that we want to model the Survived variable based on all other remaining features in the train data.

We set method = "class" so rpart knows this is a classification problem.

By default, it uses Gini impurity to select optimal splitting points.

Let‘s visualize the tree structure using the rpart.plot package:

rpart.plot(tree_model)

Decision Tree for Titanic Survival Classification

The first split is on Sex, followed by splits on Age and Fare. This matches our understanding of survival trends – females had much better survival odds than males, and lifeboat access was also related to age, class and ticket fare.

Step 3: Evaluating Model Performance

To evaluate predictive performance, let‘s get Survival predictions for the test set passengers using the tree rules:

predictions <- predict(tree_model, test, type="class")

We can compare predictions against the actual outcomes:

table(predictions, test$Survived)

             0   1
         0  75  19
         1  14  69

Great – it correctly predicted death for 75 and survival for 69 passengers. But it also wrongly predicted death for 14 and survival for 19.

Confusion matrices like this provide full insight. We can compute metrics like accuracy:

accuracy <- sum(diag(table(predictions, test$Survived))) / nrow(test)
print(accuracy)

[1] 0.8097015

Our tree model achieves 81% overall accuracy. Not bad for our first run!

But I think with some tuning, we can improve accuracy further…

Step 4: Tuning the Tree Model

The rpart control parameters allow us to tweak tree complexity.

Pruning is the process of removing leaf nodes that overfit noise in the train data. This results in a simpler tree structure more suited for generalization.

Let‘s try growing the tree deeper to 6 levels and increase the complexity parameter (cp) which controls prune severity:

control <- rpart.control(cp = 0.005, maxdepth = 6)   

tree_model2 <- rpart(Survived ~ ., data = train,  
                    method="class", control=control)

Plotting this deeper tree shows the additional splits:

Tuned Decision Tree

Let‘s check the updated test accuracy…

Accuracy: 0.8280613

The tuned model achieves 83% accuracy – marginally better! However, this takes more computation and the tree becomes less interpretable.

We have to strike a balance between depth/complexity and overfitting during parameter tuning. The optimal configuration can require some trial-and-error.

Building a Regression Tree

For our second example, let‘s switch gears to a regression problem using the iris flower dataset.

Our goal here is predicting a continuous target variable (sepal width) instead of a categorical label.

Let‘s load the data and remove the species column:

data(iris)
iris <- subset(iris, select=-Species)
str(iris)  

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
        5.1         3.5          1.4         0.2
        4.9         3.0          1.4         0.2

The data has 4 feature variables and 150 rows representing flowers.

To build the regression tree, we just modify the method:

reg_model <- rpart(Sepal.Width ~ .,    
                  data = iris,
                  method = "anova") 

We use the method "anova" instead of "class" here.

Let‘s visualize this regression tree predicting sepal width values:

Regression Tree Example

The first split is on Petal Length, followed by Petal Width. These have the highest correlation to Sepal Width in the data.

Leaf nodes show the mean Sepal Width value for observations filtered down each path.

We can compute MSE to quantify prediction error on the data:

Mean of squared residuals:  0.015

And that‘s it! This demonstrates how flexible decision trees are for both classification and regression tasks in R.

Enhancing Performance with Ensembles

A key benefit of decision trees is that they form building blocks for powerful ensemble techniques like Random Forests and Gradient Boosted Trees.

Random forests train a large number of shallow, de-correlated trees in parallel. Predictions are averaged to reduce variance and overfitting.

Gradient boosting iteratively trains trees sequentially to fit the residual errors from earlier trees. This learns in an additive fashion.

Such approaches frequently achieve state-of-the-art accuracy by combining multiple decision trees. The randomForest and xgboost packages provide these in R.

In practice, ensembles tend to outperform single decision trees significantly. But even then, plotting and analyzing the simpler trees gives insight into the patterns learned by the model.

Hope you enjoyed this beginner-friendly crash course! Feel free to reach out if you have any other questions.

Happy learning!

Read More Topics