Stepwise Linear Regression in R: An Expert Analysis

Linear regression enables modeling linear relationships between a dependent variable and one or more independent variables. The regression coefficients quantify the impact of predictors on the response.

Stepwise regression offers an automated technique to identify statistically significant variables from a set of candidates to fit linear models. It is an attractive approach for dimensionality reduction in regression problems involving many potential predictors.

This comprehensive guide will empower you to effectively apply stepwise regression in R and objectively evaluate the strengths and limitations of the selected models based on expert analysis.

Setting the Context on Linear Regression

Let‘s briefly recap key concepts in linear regression analysis that set the foundation for stepwise modeling:

  • Ordinary Least Squares (OLS) is the predominant method used to estimate the regression coefficients ($\beta$). OLS minimizes the sum of squared residuals between actual and predicted response.
  • p-values assess the statistical significance of each predictor. Lower p-values indicate stronger confidence that the predictor impacts the response.
  • R-squared evaluates the model fit by computing the proportion of variation explained by the predictors. Its values range from 0 to 1, with higher being better.

For example, let‘s fit a simple linear model on the mtcars data relating mileage (mpg) with weight (wt):

fit <- lm(mpg ~ wt, data = mtcars) 

summary(fit)
Coefficients: 
            Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 37.28512    1.87795  19.858 < 2.2e-16 *** 
wt          -5.34447    0.5591   -9.559 1.29e-10 ***

Residual standard error: 4.902 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Here we observe that:

  • Weight (wt) is a statistically significant predictor of mileage (mpg) based on its low p-value
  • The R-squared of 0.75 implies that wt accounts for 75% variation in the mpg
  • The negative coefficient suggests inverse relation between weight and fuel efficiency

This was just a simple bivariate regression. Now let‘s segue into introducing stepwise regression methodology.

Introducing Stepwise Regression

Stepwise regression automates the model selection process from a set of candidate predictors:

  • Starts with null model, tests explanatory power of each variable
  • Adds the most significant variable into regression model in first step
  • Considers remaining variables to test if they enhance model fit
  • Repeats steps of adding / removing variables based on p-value thresholds

The key highlights are:

  • Automates search for predictive variables instead of manual checking
  • Uses thresholds for p-values to add (e.g. 0.05) or remove (e.g. 0.10)
  • Can specify backwards, forwards or bidirectional elimination
  • Fast way to explore predictive models involving many variables

R provides the step() function for stepwise regression, which we will demonstrate next with examples.

Stepwise Regression in Action

Let‘s load the auto-mpg dataset and prep it for modeling:

library(dplyr)
data("mtcars") 

mtcars2 <- mtcars %>%  
  select(-c(carb, gear, cyl)) %>%
  mutate(hp = log(hp), wt = log(wt))  

We removed some variables for simplicity. Also did log transformations to normalize skewed distributions.

To demonstrate stepwise regression, let‘s start with a full model containing all variables:

full.model <- lm(mpg ~ disp + hp + drat + wt + qsec + am + vs,  
                data = mtcars2)

Then invoke stepwise regression on full.model using step():

stepwise.model <- step(full.model)

Printing the summary shows the selected model:

Call:
lm(formula = mpg ~ wt + qsec + am, data = mtcars2)

                   Estimate Std. Error    t value   Pr(>|t|)
(Intercept)       37.32130    1.87426   19.90815 9.810e-020
wt                -5.05897    0.78585   -6.43740 1.310e-006
qsec               1.26155    0.37959    3.32250 2.330e-003
am                 2.93581    1.41090    2.08074 4.671e-02

Residual standard error: 2.415 on 28 degrees of freedom
Multiple R-squared:  0.8382, Adjusted R-squared:  0.8194 
F-statistic: 41.515 on 3 and 28 DF,  p-value: 5.594e-11

The key observations are:

  • Only 3 out of 7 variables selected based on significance
  • The direction and magnitudes of coefficients are insightful
  • Strong model fit with R-squared over 0.80

This demonstrates stepwise regression‘s capability to automatically select predictive variables from many candidates. Lets dig deeper into customizing and evaluating stepwise models.

Customizing Stepwise Models

We can customize stepwise models by tailoring:

Order of Entry:

  • Backward: Start with full, sequentially remove variables
  • Forward: Start empty, sequentially add variables
  • Bidirectional: Both add and remove variables

p-value Thresholds:

  • Lower p-value for adding variables, higher for removal

For example, backward stepwise with strict threshold:

step(full.model, direction = "backward", trace = FALSE, k = 2)    

Here the k=2 sets p-value entry/exit threshold to 0.01 instead of default 0.05.

We can also automate the process to efficiently evaluate different criteria:

directions <- c("both", "backward", "forward")
kvals <- c(2, 4)

for (dir in directions) {
  for (kval in kvals){

    model <- step(full.model, direction=dir, trace=FALSE, k=kval) 
    # Customize
    print(summary(model)) 

  }
}

The ability to flexibly customize model search parameters empowers us to thoroughly explore predictive models. But we need rigorous quantitative diagnostics to judge model quality.

Judging Model Adequacy

While stepwise provides a convenient workflow for automated modeling, the selected models should be critically examined before drawing any inferences.

Key diagnostics checks include:

  • Residual analysis: residuals should be randomly distributed
  • Influence analysis: check for outliers skewing model
  • Multicollinearity: examine correlations between predictors
  • Model assumptions: homoscedasticity, linearity, normality etc

For example, residual vs fitted plot for model adequacy:

fit <- lm(mpg ~ wt + qsec + am, data = mtcars2)

plot(resid(fit), fitted(fit))
abline(h = 0, col = "red")  

Residuals vs Fitted

The random scatter suggests model assumptions are well met. If any structure observed, would require remedy.

Similarly, validating other diagnostics provides confidence in model adequacy. This quantitative rigor minimizes statistical biases.

Comparison with All Subsets Regression

An alternative modeling approach is all subsets regression, which fits linear models on all possible predictor combinations:

For example with variables x1, x2, x3:

  • (i) x1
  • (ii) x2
  • (iii) x3
  • (iv) x1 + x2
  • (vii) x1 + x2 + x3

That‘s 7 models with 3 variables against 1 model from stepwise.

Let‘s compare with regsubsets():

full.form <- mpg ~ disp + hp + drat + wt + qsec + am + vs
reg.sub <- regsubsets(full.form, mtcars2) 

summary(reg.sub)

The output shows best 4 variable model mpg ~ wt + qsec + am + disp with highest adjusted R-squared.

So exploring all subsets can potentially find better models than stepwise. But it evaluates a combinatorially large number models, becoming infeasible for say 10+ variables.

There likely exists a sweet spot between these two extremes. Being aware of alternate techniques is key to not using stepwise blindly.

Limitations of Stepwise Models

While stepwise regression provides an efficient automated workflow, few inherent limitations should be recognized:

  • Order of variable entry impacting model selection
  • Inflated p-values as model size increases
  • Overfitting on spurious dataset patterns
  • Lack of out-of-sample predictive performance
  • Violations of underlying OLS assumptions
  • Biased coefficients and confidence intervals

For example, altering order of variables might result in different models:

m1 <- step(full.model) 

m2 <- step(update(full.model, . ~ . + I(wt^2)))  

So results have an arbitrary element and models shouldn‘t be accepted as gospel truth. Supplementing with domain expertise and other technical analyses is vital.

Putting It All Together

We walked through various elements of building stepwise regression models in R:

  • Fundamentals: Recapped key concepts in linear regression analysis
  • Introduction: Discussed rationale for stepwise modeling
  • In Action: Demonstrated hands-on example of using stepwise regression
  • Customizing: Showed how to tailor model search parameters
  • Diagnostics: Emphasized importance of evaluating model adequacy
  • Comparisons: Contrasted with all subsets regression
  • Limitations: Presented common pitfalls objectively

The key takeaway is being an informed user of stepwise regression. Leverage it for automated modeling, but customize sensibly and diagnose rigorously before relying on the models.

Combining its efficiency benefits while mitigating limitations by using complementary techniques will enable deploying it most effectively in practice.

Read More Topics