Chapter 2: When coefficients need restraint

Regularization answers a common Day 1 problem: your model fits training data very well, but coefficients become unstable when predictors correlate or when predictors are numerous.

The idea is to trade a bit of bias for a big gain in stability. Instead of letting coefficients grow freely, we constrain them so the model is less sensitive to noise.

← Reading companion · Back: Chapter 1 · Next: Chapter 3

Ridge, lasso, and elastic net in words

Ridge ((L_2)) shrinks all coefficients toward zero but usually keeps them non-zero. Lasso ((L_1)) can shrink some coefficients all the way to zero, which gives a sparse model. Elastic net blends both behaviors.

On gene-style tasks with many correlated features, this often gives more robust out-of-sample behavior than pure subset search.

Where AIC differs

AIC stepwise is a discrete search over included/excluded variables. Lasso is a continuous shrinkage path controlled by (). Both can be useful, but they answer slightly different optimization questions.

In class, comparing AIC and lasso on the same gene data is valuable because students can see that “best model” depends on the search strategy and penalty structure.

Minimal R example

Same predictors as the pipeline modules: Adelie vs Gentoo, bill + island + sex + year. scale() predictors before glmnet (penalties are not invariant to units).

library(glmnet)
library(palmerpenguins)
library(dplyr)

peng <- palmerpenguins::penguins |>
  filter(species %in% c("Adelie", "Gentoo")) |>
  mutate(y = as.integer(species == "Gentoo")) |>
  select(bill_length_mm, bill_depth_mm, island, sex, year, y) |>
  tidyr::drop_na()

x <- model.matrix(y ~ bill_length_mm + bill_depth_mm + island + sex + year, peng)[, -1]
y <- peng$y
x_s <- scale(x)

cv_ridge <- cv.glmnet(x_s, y, family = "binomial", alpha = 0, nfolds = 5)
cv_lasso <- cv.glmnet(x_s, y, family = "binomial", alpha = 1, nfolds = 5)

data.frame(
  model = c("Ridge (alpha=0)", "Lasso (alpha=1)"),
  lambda_1se = c(cv_ridge$lambda.1se, cv_lasso$lambda.1se),
  n_nonzero = c(
    sum(coef(cv_ridge, s = "lambda.1se") != 0) - 1L,
    sum(coef(cv_lasso, s = "lambda.1se") != 0) - 1L
  )
)

            model  lambda_1se n_nonzero
1 Ridge (alpha=0) 0.045615242         6
2 Lasso (alpha=1) 0.001183727         4

Try alpha = 0.5 for elastic net on the same x_s, y.

Tip

Regularization is not a magic upgrade; it is a disciplined compromise. You intentionally bias coefficients a little so predictions and selected features are less brittle when data changes.

Lecture: Day 1 (Monday) slides
Notebook: linear-regression-lasso.Rmd
Next: Chapter 3 — Rules and trees