rec_ext <- recipe(y ~ ., data = train) |>
step_impute_median(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
step_pca(all_numeric_predictors(), num_comp = 10)
wf_rf <- workflow() |>
add_recipe(rec_ext) |>
add_model(rand_forest() |> set_engine("ranger") |> set_mode("classification"))
wf_xgb <- workflow() |>
add_recipe(rec_ext) |>
add_model(boost_tree() |> set_engine("xgboost") |> set_mode("classification"))Chapter 5: Stronger learners, same discipline
After Tuesday’s single-tree workflow, Thursday introduces stronger learners: random forests, boosting, and small neural networks. The core teaching point is not “new model, new universe.” It is “new model, same discipline.”
You still keep preprocessing, resampling, and scoring rules fixed. Only the learner specification changes.
← Reading companion · Back: Chapter 4 · Next: Chapter 6
What changes and what does not
What stays fixed:
- the scientific question,
- the recipe logic,
- the fold structure,
- the chosen metrics.
What changes:
- the
parsnipspecification (rand_forest(),boost_tree(),mlp()), - the tuning parameters,
- the bias-variance behavior.
Three common families in plain language
Random forest: average many decorrelated trees to reduce variance.
Boosting: add trees sequentially so each new tree focuses on previous errors.
Small MLP: learn layered nonlinear combinations; can work well, but usually needs more tuning care on small tabular data.
None of these models removes the need for honest evaluation. With small data, tiny score differences are often not meaningful without uncertainty.
Same workflow, different model spec
The most transferable skill is not memorizing model families. It is learning to compare models fairly: same data pipeline, same folds, same score sheet, explicit uncertainty.
Lecture: Day 4 (Thursday) Part B
Notebook: penguins-classification.Rmd
Next: Chapter 6 — Scores that match the question