Chapter 5: Stronger learners, same discipline

After Tuesday’s single-tree workflow, Thursday introduces stronger learners: random forests, boosting, and small neural networks. The core teaching point is not “new model, new universe.” It is “new model, same discipline.”

You still keep preprocessing, resampling, and scoring rules fixed. Only the learner specification changes.

← Reading companion · Back: Chapter 4 · Next: Chapter 6


What changes and what does not

What stays fixed:

  • the scientific question,
  • the recipe logic,
  • the fold structure,
  • the chosen metrics.

What changes:

  • the parsnip specification (rand_forest(), boost_tree(), mlp()),
  • the tuning parameters,
  • the bias-variance behavior.

Three common families in plain language

Random forest: average many decorrelated trees to reduce variance.
Boosting: add trees sequentially so each new tree focuses on previous errors.
Small MLP: learn layered nonlinear combinations; can work well, but usually needs more tuning care on small tabular data.

None of these models removes the need for honest evaluation. With small data, tiny score differences are often not meaningful without uncertainty.


Same workflow, different model spec

rec_ext <- recipe(y ~ ., data = train) |>
  step_impute_median(all_numeric_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_pca(all_numeric_predictors(), num_comp = 10)

wf_rf <- workflow() |>
  add_recipe(rec_ext) |>
  add_model(rand_forest() |> set_engine("ranger") |> set_mode("classification"))

wf_xgb <- workflow() |>
  add_recipe(rec_ext) |>
  add_model(boost_tree() |> set_engine("xgboost") |> set_mode("classification"))

Tip

The most transferable skill is not memorizing model families. It is learning to compare models fairly: same data pipeline, same folds, same score sheet, explicit uncertainty.


Lecture: Day 4 (Thursday) Part B
Notebook: penguins-classification.Rmd
Next: Chapter 6 — Scores that match the question