Chapter 4: A reproducible modeling workflow

This chapter turns model building into a repeatable workflow. Instead of fitting, preprocessing, and scoring in disconnected scripts, we keep the full pipeline in one object so comparisons stay fair.

The teaching task is Adelie vs Gentoo classification. We use bill measures, island, sex, and year; we leave out flipper length and body mass so tuning remains informative rather than trivially high-scoring.

How to read this page

You already know lm and glm. Here we extend that mental model:

  • A recipe is preprocessing.
  • A model specification declares the learner.
  • A workflow binds them.
  • tune_grid() explores settings with CV.
  1. First: one train/test split and last_fit() to show the smallest honest flow.
  2. Then: the canonical Day 2 tuned tree include (shared with slides) for the full pattern.

← Reading companion · Back: Chapter 3 · Next: Chapter 5


First: one split, one model, no tuning (last_fit)

Day 2 (Tuesday) slides focus on cross-validation + tuning (faster to teach one rhythm in the room). If you want the smallest end-to-end tidymodels example first — hold out test rows, fit on training only, then score — work through this block.

Predictors (both sections): bill length & depth, island, sex, year — we drop flipper length and body mass so the tuned tree section below is not “everything wins at 99%.” The classification notebook still uses full morphology for classical glm / rpart plots.

Packages: tidymodels, palmerpenguins, dplyr, tidyr.

library(tidymodels)
library(palmerpenguins)
library(dplyr)
library(tidyr)

peng_cls <- palmerpenguins::penguins |>
  dplyr::filter(species %in% c("Adelie", "Gentoo")) |>
  dplyr::mutate(
    y = factor(species, levels = c("Adelie", "Gentoo")),
    year = as.numeric(year)
  ) |>
  dplyr::select(-species, -flipper_length_mm, -body_mass_g) |>
  tidyr::drop_na()
set.seed(11)
p_split <- initial_split(peng_cls, prop = 0.75, strata = y)
train <- training(p_split)
test <- testing(p_split)

count(train, y, name = "n_train")
# A tibble: 2 × 2
  y      n_train
  <fct>    <int>
1 Adelie     109
2 Gentoo      89
count(test, y, name = "n_test")
# A tibble: 2 × 2
  y      n_test
  <fct>   <int>
1 Adelie     37
2 Gentoo     30
rec_tt <- recipe(y ~ ., data = train) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())
mod_glm <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

wf_tt <- workflow() |>
  add_recipe(rec_tt) |>
  add_model(mod_glm)
fit_split <- last_fit(wf_tt, split = p_split, metrics = metric_set(accuracy, roc_auc))
→ A | warning: glm.fit: algorithm did not converge
→ B | warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
There were issues with some computations   A: x1   B: x1
There were issues with some computations   A: x1   B: x1
collect_metrics(fit_split)
# A tibble: 2 × 4
  .metric  .estimator .estimate .config        
  <chr>    <chr>          <dbl> <chr>          
1 accuracy binary             1 pre0_mod0_post0
2 roc_auc  binary             1 pre0_mod0_post0
collect_predictions(fit_split) |>
  conf_mat(truth = y, estimate = .pred_class)
          Truth
Prediction Adelie Gentoo
    Adelie     37      0
    Gentoo      0     30

When you continue to the canonical pipeline below, the include redefines peng_cls in a self-contained way (so slide chunks remain portable). The modeling grammar stays the same; only the resampling and tuning machinery becomes richer.


Canonical pipeline (Day 2 slides + tuning)

The next block is the same canonical walkthrough used in the Day 2 lecture deck. Keeping one shared include means the chapter and slides do not drift apart.

Palmer Penguins: one table, one model family (trees)

  • Outcome y: Adelie vs Gentoo (complete cases).
  • Predictors: bill length & depth, island, sex, year — no flipper or body mass (so tree_depth and min_n show a real tuning tradeoff).
  • Today’s engine: decision_tree() + rpart only — forests and boosting come on Thursday.

The pipeline in principle

No code yet — the shape of every tidymodels project this week:

flowchart LR
  data[Raw table] --> recipe[recipe prep bake]
  recipe --> spec[parsnip spec]
  spec --> wf[workflow]
  folds[vfold_cv folds] --> tune[tune_grid]
  metrics[metric_set] --> tune
  wf --> tune
  tune --> best[select_best finalize]
  best --> fit[fit predict]
  fit --> out[Plots and tables]
  • Recipe steps run inside each CV fold — never scale or encode on the full table before splitting (avoids leakage).
  • recipe / spec / workflow are blueprints; only fit(), fit_resamples(), and tune_grid() touch rows.
  • Today’s score: accuracy only; Thursday adds metrics for imbalanced data.

Five steps — what each one does

  1. Recipe — declare preprocessing steps (drop useless columns, dummy-code categories, scale numerics). prep() learns rules on training rows; bake() applies them.
  2. Spec — choose model type, engine, mode, and which arguments to tune. Still a blueprint — not fitted.
  3. Workflow — glue recipe + spec so one fit() runs preprocessing and training in the right order.
  4. Metrics — define how to judge predictions (accuracy today).
  5. Resample and tune — search hyperparameters on vfold_cv folds, pick the best, then fit for plots and interpretation.

Packages in the pipeline

Package Job
rsample Training vs assessment rows (initial_split, vfold_cv)
recipes Preprocessing checklist (dummy, scale, …)
parsnip Model blueprint (decision_tree, set_engine("rpart"))
workflows Glue recipe + model
tune Search tree_depth and min_n
yardstick Scores — accuracy only today; more metrics Thursday

Step 1 — Preprocessing (recipe)

Turn raw columns into a model-ready table — in order, and only on training rows inside each fold.

Step 1 — Why a recipe?

  • A recipe is a checklist of transforms applied before the model sees the data.
  • Steps run top to bottom; order matters (e.g. dummy-code before scaling).
  • prep() learns parameters on training data; bake() applies them — a workflow() does both automatically inside CV.

Start the recipe

rec <- recipe(y ~ ., data = peng_bal)
rec
  • recipe(y ~ ., data = peng_bal)y is the outcome; . = all other columns are predictors.
  • Blueprint only — nothing transformed yet.

step_zv() — drop zero-variance columns

rec <- rec |>
  step_zv(all_predictors())
rec
  • Removes columns with one unique value (no predictive power; can break algorithms in small folds).

step_dummy() — encode categories

rec <- rec |>
  step_dummy(all_nominal_predictors())
rec
  • island and sex → numeric 0/1 columns for rpart.

step_normalize() — scale numeric predictors

rec <- rec |>
  step_normalize(all_numeric_predictors())
rec
  • Z-score each numeric column using training-fold rows only inside CV.
  • Good habit before you swap in other model types later in the week.
rec_base <- rec
rec_base

prep() and bake() — learn rules, then apply

  • prep(recipe, training = train) estimates each step on training rows only.
  • bake(prep, new_data = ...) applies those rules without re-learning (train, test, or new rows).
  • A workflow() runs prep + bake + fit inside each CV fold automatically.
set.seed(2)
split_demo <- initial_split(peng_bal, prop = 0.75, strata = y)
train_demo <- training(split_demo)
test_demo <- testing(split_demo)

rec_prep <- prep(rec_base, training = train_demo)
bake_train <- bake(rec_prep, new_data = train_demo)
bake_test <- bake(rec_prep, new_data = test_demo)

dplyr::glimpse(bake_train)
Rows: 198
Columns: 7
$ bill_length_mm   <dbl> -0.68460562, -0.60823834, -0.45550379, -1.14280927, -…
$ bill_depth_mm    <dbl> 0.96969431, 0.30972785, 0.61432775, 1.27429421, 1.934…
$ year             <dbl> -1.293173, -1.293173, -1.293173, -1.293173, -1.293173…
$ y                <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adeli…
$ island_Dream     <dbl> -0.5175626, -0.5175626, -0.5175626, -0.5175626, -0.51…
$ island_Torgersen <dbl> 2.1159567, 2.1159567, 2.1159567, 2.1159567, 2.1159567…
$ sex_male         <dbl> 0.9874465, -1.0075984, -1.0075984, -1.0075984, 0.9874…
NoteAfternoon lab (microbiome)

Same pattern on OTU counts — add step_mutate(..., log1p(.x)) on skewed abundances before step_zv.

Step 2 — Model blueprint (spec)

Choose the learner and its settings — still no fitting.

Step 2 — What is a spec?

  • parsnip stores the model family and hyperparameters in a spec object.
  • set_engine("rpart") picks the R implementation; set_mode("classification") matches a factor outcome.
  • tune() marks arguments we will search later — the spec stays a blueprint until fit().

Classification tree with rpart

tree_spec_demo <- decision_tree(tree_depth = 4, min_n = 10) |>
  set_engine("rpart") |>
  set_mode("classification")
tree_spec_demo
Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = 4
  min_n = 10

Computational engine: rpart 
  • decision_tree() — fixed parsnip function name.
  • tree_depth, min_n — depth limit and minimum leaf size.
  • set_engine("rpart") — same engine as the gene trees this morning.
  • set_mode("classification") — factor outcome y.

Tuning knobs (tune())

tree_spec <- decision_tree(
  tree_depth = tune(),
  min_n = tune()
) |>
  set_engine("rpart") |>
  set_mode("classification")
tree_spec
Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = tune()
  min_n = tune()

Computational engine: rpart 
  • tune() marks arguments tune_grid() will search — we pick winners by cross-validated accuracy.

Parsnip: type → engine → mode

Model type Valid engines (examples) Modes
decision_tree() "rpart", "C5.0" classification, regression
rand_forest() "ranger", "randomForest" classification, regression
boost_tree() "xgboost", "lightgbm" classification, regression

Thursday swaps spec — same recipe and workflow() pattern.

decision_tree() |>
  set_engine("lm") |>
  set_mode("classification")
# Error: 'lm' is not a valid engine for decision_tree()

Step 3 — Workflow

Bundle preprocessing and model into one object.

Step 3 — Bundle recipe and model

  • workflow() combines rec_base and tree_spec so you never preprocess the full table before splitting.
  • add_recipe() + add_model() — one object to pass to fit(), tune_grid(), or fit_resamples().
  • Inside each resample, the workflow prep → bake → train on analysis rows only.

Step 3 — workflow() code

wf <- workflow() |>
  add_recipe(rec_base) |>
  add_model(tree_spec)
wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_zv()
• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = tune()
  min_n = tune()

Computational engine: rpart 

Step 3 — One-shot fit (intuition)

  • fit(wf, data = peng_bal) on all rows is a quick way to see a decision boundary — not our final honest evaluation.
  • We use a fixed demo spec (tree_spec_demo) so the plot is stable before tuning.
  • After tuning (Step 5), we refit on all rows for teaching plots; in a project, keep a locked test set.

Step 3 — Fit and boundary (code)

wf_demo <- workflow() |>
  add_recipe(rec_base) |>
  add_model(tree_spec_demo)
fit_demo <- fit(wf_demo, data = peng_bal)

Step 4 — Metrics (accuracy today)

Decide how to score predictions before you tune.

Step 4 — Choose a metric before tuning

  • accuracy — fraction of correct class labels; fine when classes are balanced and errors cost the same.
  • Thursday: sensitivity, ROC-AUC, PR-AUC when the positive class is rare — accuracy can mislead.
  • metric_set(accuracy) bundles scorers for tune_grid() and fit_resamples().

Step 4 — metric_set code

metrics_day2
A metric set, consisting of:
- `accuracy()`, a class metric | direction: maximize

Step 5 — Cross-validation and tuning

Search tree_depth and min_n on held-out folds, then finalize.

fit() vs fit_resamples() vs tune_grid()

Function Data splits Parameter combinations
fit() 1 training set 1
fit_resamples() Many folds/resamples 1
tune_grid() Many folds/resamples Many
  • Use fit() when hyperparameters are already chosen.
  • Use fit_resamples() to estimate CV performance for one fixed workflow.
  • Use tune_grid() when you need to search over multiple hyperparameter values.

In this walkthrough: fit_demo (plot), rs_tree (CV, fixed params), tune_res (CV + search).

Step 5 — Search hyperparameters on folds

  • folds <- vfold_cv(..., strata = y) — each fold keeps both species represented.
  • tune_grid(wf, resamples = folds, grid = ..., metrics = metrics_day2) tries many specs; preprocessing is refit inside each fold.
  • Grid spans stumps (depth 1, large min_n) to more flexible trees.

Step 5 — tune_grid code

set.seed(2)
tune_res <- tune_grid(
  wf,
  resamples = folds,
  grid = grid_regular(
    tree_depth(range = c(1, 6)),
    min_n(range = c(2, 50)),
    levels = 4
  ),
  metrics = metrics_day2
)

Step 5 — Read the tuning plot

  • autoplot(tune_res) shows mean accuracy vs tree_depth and min_n.
  • Expect lower accuracy at tree_depth = 1 (stump) and with large min_n (heavy pruning).
  • Pick a region that balances fit and simplicity before select_best().

Step 5 — Tuning plot (code)

tune_res |>
  autoplot() +
  theme_minimal() +
  labs(
    title = "Tuning: accuracy vs tree_depth and min_n",
    subtitle = "Bill + island + sex + year — no flipper/mass"
  )

The honest evaluation workflow

You split your data into training and test sets, then use cross-validation on the training set to tune hyperparameters for each model and compute their average performance. For each model type, you select the best-performing hyperparameter combination based on CV results, and then compare these tuned models to choose the best overall model. Finally, you refit that selected model on the full training data and evaluate it once on the untouched test set to get an unbiased estimate of real-world performance.

Stage What happens tidymodels tools (examples)
Hold out test data Test rows stay unused until the very end initial_split(), training(), testing()
Tune on training only CV estimates average performance while searching hyperparameters vfold_cv(), tune_grid(), select_best()
Compare learners Same folds + recipe; pick the best model type after tuning each fit_resamples(), collect_metrics()
Final report Refit on all training rows; score once on test last_fit()
  • Recipe steps (dummy, scale, impute, PCA, upsample, …) must be fit inside each CV fold on training rows only - otherwise information leaks from held-out data.
  • fit_resamples() (next) averages CV metrics for a workflow - quick comparisons when specs are fixed or already tuned.
  • Chapter 4 — train/test split walks through last_fit() with a real train/test split.

Step 5 — Finalize and refit

  • select_best(tune_res, metric = "accuracy") — best hyperparameters from CV.
  • finalize_workflow(wf, best) — plug tuned values into the blueprint.
  • fit(final_wf, data = peng_cls) — one tree on all rows for plots; fit_resamples() reports honest CV performance.
  • In projects, use last_fit() with a true test set instead of fitting on all rows for reporting.

Step 5 — Finalize and CV metrics (code)

best <- select_best(tune_res, metric = "accuracy")
final_wf <- finalize_workflow(wf, best)
final_fit <- fit(final_wf, data = peng_cls)
best
# A tibble: 1 × 3
  tree_depth min_n .config         
       <int> <int> <chr>           
1          2     2 pre0_mod05_post0
set.seed(6)
wf_tree <- final_wf
rs_tree <- fit_resamples(wf_tree, folds, metrics = metrics_day2)
metrics_summary_table(rs_tree, "Tuned tree (5-fold CV)")
# A tibble: 1 × 4
  model                  .metric   mean std_err
  <chr>                  <chr>    <dbl>   <dbl>
1 Tuned tree (5-fold CV) accuracy 0.989 0.00465

Fitted tree structure

final_fit |>
  extract_fit_engine() |>
  rpart.plot(
    roundint = FALSE,
    type = 4,
    extra = 104,
    box.palette = "RdBu",
    main = "Final decision tree (rpart)"
  )

Variable importance (this tree)

final_fit |>
  extract_fit_parsnip() |>
  vip(geom = "point", aesthetics = list(color = "midnightblue", size = 3)) +
  theme_minimal() +
  labs(
    title = "Split-based importance (single tree)",
    subtitle = "Correlated bills ≠ causation — full VIP chapter on Thursday"
  )


Lecture: Day 2 (Tuesday) pipeline slides
Notebook: penguins-canonical-pipeline.Rmd
Next: Chapter 5 — Stronger learners, same discipline