Day 2 (Tuesday): Trees, rpart, and the tidymodels Pipeline

Learning objectives

  • Frame Breiman’s two cultures: same goals and problems, different modeling approaches (not “stats vs ML”)
  • Build intuition for decision trees (rpart): recursive splits, partitions, and interpretable rules
  • Walk through gene × disease (classification) and gene × trait (regression) trees step by step
  • Read |> chains for tidy tables and tidymodels preprocessing
  • Build recipe → spec → workflow → tune → fit on Palmer Penguins with decision_tree + rpart and accuracy on cross-validation

Part A - Two cultures and trees

Two cultures (Breiman, 2001)

Leo Breiman’s Statistical Modeling: The Two Cultures argues the useful split is not “statistics vs machine learning.”

  • Scientists often share the same goals (learn patterns, predict, support decisions) and work on the same kinds of data
  • What differs is how you build a model - what you assume up front and what you optimize

Data modeling culture

  • Starts with a stochastic model one can write down (Monday: lm, glm, lasso on known structure)
  • Emphasizes parameters, uncertainty, and whether the model class is adequate
  • Strength: transparent assumptions
  • Risk: the model may be wrong in shape even when it fits “well”

Algorithmic modeling culture

  • Treat the data-generating mechanism as largely unknown
  • Let a flexible algorithm find structure; judge it by predictive performance on data not used to fit
  • Strength: captures interactions and nonlinear boundaries
  • Risk: needs careful evaluation so flexibility does not fool you on the training data (overfitting)

Decision Trees

Linear models force us to look at the world through straight lines and additive math (glm). If two variables interact, we have to manually tell the model to multiply them. Decision trees change that. They naturally capture complex, non-linear relationships by asking a sequence of simple, nested questions.

  • Automatic Interactions: Captures non-linear boundaries and complex combinations of variables without you having to manually write out interaction terms.
  • Human-Readable Decision Logic: Every prediction can be traced back to a clear, if-then story based on actual measurements.

Today we will implement this splitting logic using the tidymodels framework, keeping rpart as our underlying engine.

rpart in one slide

Function call:

  • Classification: rpart(factor_y ~ ., data = ..., method = "class") - splits improve class purity per cell
  • Regression: rpart(trait ~ GeneA + GeneB, data = ...) - numeric y uses method = "anova"; each leaf predicts a mean trait

Tuning parameters (rpart.control())

Set in control = rpart.control(cp = ..., minsplit = ..., maxdepth = ...) - next slide shows the effect of each knob.

rpart.control(): cp, minsplit, maxdepth

rpart.control() What it does Turn it up →
cp Complexity penalty (pruning) Simpler tree (fewer splits)
minsplit Minimum cases in a node before trying another split Fewer split attempts (coarser tree)
maxdepth Maximum depth from root to leaf Shallower tree

tidymodels names (this afternoon): cost_complexitycp, min_nminsplit, tree_depthmaxdepth.

  • Same toy data in every panel - only one control changes per row.
  • cp: pruning / cost-complexity trade-off (also tuned via $cptable later).
  • minsplit: stops splitting when nodes are too small (stability on small n).
  • maxdepth: hard cap on how many questions the tree can ask in a row.

Part B - Gene simulations

Gene × disease (same simulation as Monday)

  • Binary disease from simulated gene expression - instructors know causal genes (Gene1, Gene3, Gene4, interaction)

  • Classification tree on all genes

  • Visualized as a tree and in the Gene1 × Gene3 plane (other genes kept constant at their means)

  • Notebook: logistic-regression-gene-disease.Rmd

Final tree (rpart.plot)

Final decision surface (Gene1 vs Gene3)

Other genes fixed at their sample means - same 2D slice as Monday’s logistic boundary.

Growing the tree (split by split)

Four steps (one new split each): tree (left, gold box + node number) and partition in Gene1 × Gene3 (right, dashed line when the new split uses those genes).

Growing the tree: split 1

New split on Gene3 (node 1) - highlighted in gold on the tree.

Growing the tree: split 2

New split on Gene1 (node 2) - highlighted in gold on the tree.

Growing the tree: split 3

New split on Gene1 (node 3): >= 0.382. Gold box in the tree; dashed line in the partition when the split uses Gene1 or Gene3.

Growing the tree: split 4

New split on Gene4 (node 6): >= -0.0971. Gold box in the tree; dashed line in the partition when the split uses Gene1 or Gene3.

Tree vs logistic (same data, in-sample accuracy)

In-sample accuracy on all n rows (Monday vs Tuesday model class)
Model Accuracy_in
Logistic (glm) 0.74
Decision tree (rpart) 0.93

Tree vs logistic - boundaries side by side

Same Gene1 × Gene3 slice; other genes at their means.

Gene × trait (Monday simulation, regression tree)

  • Continuous trait from GeneA / GeneB - same ground truth as Monday’s lm / lasso block
  • rpart(trait ~ GeneA + GeneB, cp = ...) - sweep cp from $cptable to show how the predicted surface and tree change
  • Notebook: linear-regression-lasso.Rmd

Gene × trait - regression trees (rpart) on the same simulation as Monday (linear-regression-lasso.Rmd): continuous trait from GeneA and GeneB.

Gene × trait (continuous outcome)

  • Same story as Monday: trait = -0.5 \times GeneA + 0.8 \times GeneB + noise (linear truth)
  • rpart(trait ~ GeneA + GeneB, ...) - default method = "anova" for a numeric y (splits minimize within-node variance)
  • Contrast with disease block: method = "class" for a factor outcome
  • Notebook loop over cp values: linear-regression-lasso.Rmd

Complexity penalty cp (pruning)

  • Fit a large tree first; cp in rpart(..., cp = ...) keeps splits only if they improve the fit enough relative to tree size
  • $cptable lists candidate cp values (we skip the first row, as in the notebook)
  • Larger cpsimpler tree (fewer splits); smaller cp → more regions, smoother mosaic of predicted trait

Trait tree: step 1 of 4

cp = 0.112 - 2 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.

Trait tree: step 2 of 4

cp = 0.0493 - 6 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.

Trait tree: step 3 of 4

cp = 0.0225 - 12 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.

Trait tree: step 4 of 4

cp = 0.01 - 16 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.

Trait: tree vs linear (in-sample RMSE)

In-sample RMSE on all n rows - same genes as Monday’s simple linear model
Model RMSE_in
Linear (GeneA + GeneB) 0.290
rpart (cp = 0.01) 0.416

Part C - Palmer Penguins: tidymodels pipeline (trees only)

Meet the Palmer penguins

Real data for the rest of Tuesday and again on Thursday: bill measurements, species labels, and honest evaluation.

Artwork: Allison Horst / Palmer Penguins LTER study (teaching use).

From manual rpart to a reusable pipeline

  • Monday compared models on all rows; today we hold out information via cross-validation while tuning
  • One workflow runs preprocessing + tree on each fold - same discipline as the microbiome lab this afternoon
  • Dataset: Palmer Penguins data card - Adelie vs Gentoo; bill + island + sex + year

Short tidyverse refresher before the tidymodels pipeline (same pipe grammar as recipes and workflows).

Tidy data and the pipe |>

  • Each column is one variable; each row is one penguin (or one sample)
  • x |> f() means: take the result of x and pass it as the first argument to f()
  • Read chains top to bottom - same habit you will use for recipe() |> step_*()

Example: filter and select

pg <- palmerpenguins::penguins |>
  filter(!is.na(sex), !is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
  select(species, island, sex, bill_length_mm, bill_depth_mm, body_mass_g)

head(pg, 4)
# A tibble: 4 × 6
  species island    sex    bill_length_mm bill_depth_mm body_mass_g
  <fct>   <fct>     <fct>           <dbl>         <dbl>       <int>
1 Adelie  Torgersen male             39.1          18.7        3750
2 Adelie  Torgersen female           39.5          17.4        3800
3 Adelie  Torgersen female           40.3          18          3250
4 Adelie  Torgersen female           36.7          19.3        3450

Example: mutate (new columns)

pg |>
  mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
  select(species, bill_length_mm, bill_depth_mm, bill_ratio) |>
  head(4)
# A tibble: 4 × 4
  species bill_length_mm bill_depth_mm bill_ratio
  <fct>            <dbl>         <dbl>      <dbl>
1 Adelie            39.1          18.7       2.09
2 Adelie            39.5          17.4       2.27
3 Adelie            40.3          18         2.24
4 Adelie            36.7          19.3       1.90

Same grammar in tidymodels

  • Table pipelines: penguins |> filter(...) |> mutate(...)
  • Preprocessing: recipe(y ~ ., data = train) |> step_dummy() |> step_normalize()
  • Modeling objects: workflow() |> add_recipe(rec) |> add_model(spec) - here you chain objects, not tables, but the read left-to-right habit is the same

Palmer Penguins: one table, one model family (trees)

  • Outcome y: Adelie vs Gentoo (complete cases).
  • Predictors: bill length & depth, island, sex, year - no flipper or body mass (so tree_depth and min_n show a real tuning tradeoff).
  • Today’s engine: decision_tree() + rpart only - forests and boosting come on Thursday.

The pipeline in principle

No code yet - the shape of every tidymodels project this week:

flowchart LR
  data[Raw table] --> recipe[recipe prep bake]
  recipe --> spec[parsnip spec]
  spec --> wf[workflow]
  folds[vfold_cv folds] --> tune[tune_grid]
  metrics[metric_set] --> tune
  wf --> tune
  tune --> best[select_best finalize]
  best --> fit[fit predict]
  fit --> out[Plots and tables]
  • Recipe steps run inside each CV fold - never scale or encode on the full table before splitting (avoids leakage).
  • recipe / spec / workflow are blueprints; only fit(), fit_resamples(), and tune_grid() touch rows.
  • Today’s score: accuracy only; Thursday adds metrics for imbalanced data.

Five steps - what each one does

  1. Recipe - declare preprocessing steps (drop useless columns, dummy-code categories, scale numerics). prep() learns rules on training rows; bake() applies them.
  2. Spec - choose model type, engine, mode, and which arguments to tune. Still a blueprint - not fitted.
  3. Workflow - glue recipe + spec so one fit() runs preprocessing and training in the right order.
  4. Metrics - define how to judge predictions (accuracy today).
  5. Resample and tune - search hyperparameters on vfold_cv folds, pick the best, then fit for plots and interpretation.

Packages in the pipeline (Tuesday focus)

Package Job today
recipes Preprocessing checklist (dummy, scale, …)
parsnip Model blueprint (decision_tree, set_engine("rpart"))
workflows Glue recipe + model
rsample Folds (vfold_cv) - used in Step 5
tune Search tree_depth and min_n
yardstick accuracy scoring

Step 1 - Preprocessing (recipe)

Turn raw columns into a model-ready table - in order, and only on training rows inside each fold.

Step 1 - Why a recipe?

  • A recipe is a checklist of transforms applied before the model sees the data.
  • Steps run top to bottom; order matters (e.g. dummy-code before scaling).
  • prep() learns parameters on training rows; bake() applies those rules to new rows without re-learning.
  • A workflow() runs prep + bake + fit in the right order inside each CV fold.

Recipe pipeline (visual)

Start the recipe

rec <- recipe(y ~ ., data = peng_bal)
rec
  • recipe(y ~ ., data = peng_bal) - y is the outcome; . = all other columns are predictors.
  • Blueprint only - nothing transformed yet.

step_zv() - drop zero-variance columns

rec <- rec |>
  step_zv(all_predictors())
rec
  • Removes columns with one unique value (no predictive power; can break algorithms in small folds).

step_dummy() - encode categories

rec <- rec |>
  step_dummy(all_nominal_predictors())
rec
  • island and sex → numeric 0/1 columns for rpart.

step_normalize() - scale numeric predictors

rec <- rec |>
  step_normalize(all_numeric_predictors())
rec
  • Z-score each numeric column using training-fold rows only inside CV.
  • Good habit before you swap in other model types later in the week.
rec_base <- rec
rec_base

prep() and bake() — learn rules, then apply

  • prep(recipe, training = train) estimates each step on training rows only (means for scaling, dummy levels, etc.).
  • bake(prep, new_data = ...) applies those rules to any table (train, test, or one new penguin) without re-learning.
  • A workflow() calls prep + bake + fit in the right order inside each CV fold - you rarely call them by hand except for checks and tools like SHAP (Thursday).
set.seed(2)
split_demo <- initial_split(peng_bal, prop = 0.75, strata = y)
train_demo <- training(split_demo)
test_demo <- testing(split_demo)

rec_prep <- prep(rec_base, training = train_demo)
bake_train <- bake(rec_prep, new_data = train_demo)
bake_test <- bake(rec_prep, new_data = test_demo)

dplyr::glimpse(bake_train)
Rows: 198
Columns: 7
$ bill_length_mm   <dbl> -0.68460562, -0.60823834, -0.45550379, -1.14280927, -…
$ bill_depth_mm    <dbl> 0.96969431, 0.30972785, 0.61432775, 1.27429421, 1.934…
$ year             <dbl> -1.293173, -1.293173, -1.293173, -1.293173, -1.293173…
$ y                <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adeli…
$ island_Dream     <dbl> -0.5175626, -0.5175626, -0.5175626, -0.5175626, -0.51…
$ island_Torgersen <dbl> 2.1159567, 2.1159567, 2.1159567, 2.1159567, 2.1159567…
$ sex_male         <dbl> 0.9874465, -1.0075984, -1.0075984, -1.0075984, 0.9874…
  • bake_train: model-ready columns the tree sees during fit().
  • bake_test: same encoding using training statistics - that is what prevents leakage.

Afternoon lab — Task 2.1

Tip

Build the recipe — log1p on OTUs, step_zv, step_normalize on the microbiome table (Label = Early vs Late).

Lab exercises · Solution · Download Rmd

Step 2 - Model blueprint (spec)

Choose the learner and its settings - still no fitting.

Step 2 - What is a spec?

  • parsnip stores the model family and hyperparameters in a spec object.
  • set_engine("rpart") picks the R implementation; set_mode("classification") matches a factor outcome.
  • tune() marks arguments we will search later - the spec stays a blueprint until fit().

Classification tree with rpart

tree_spec_demo <- decision_tree(tree_depth = 4, min_n = 10) |>
  set_engine("rpart") |>
  set_mode("classification")
tree_spec_demo
Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = 4
  min_n = 10

Computational engine: rpart 
  • decision_tree() - fixed parsnip function name.
  • tree_depth, min_n - depth limit and minimum leaf size.
  • set_engine("rpart") - same engine as the gene trees this morning.
  • set_mode("classification") - factor outcome y.

Afternoon lab — Task 2.2

Tip

Choose the tree modeldecision_tree() + set_engine("rpart"), classification mode.

Lab exercises · Solution · Download Rmd

Tuning knobs (tune())

tree_spec <- decision_tree(
  tree_depth = tune(),
  min_n = tune()
) |>
  set_engine("rpart") |>
  set_mode("classification")
tree_spec
Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = tune()
  min_n = tune()

Computational engine: rpart 
  • tune() marks arguments tune_grid() will search - we pick winners by cross-validated accuracy.

Parsnip: type → engine → mode

Model type Valid engines (examples) Modes
decision_tree() "rpart", "C5.0" classification, regression
rand_forest() "ranger", "randomForest" classification, regression
boost_tree() "xgboost", "lightgbm" classification, regression

Thursday swaps spec - same recipe and workflow() pattern.

decision_tree() |>
  set_engine("lm") |>
  set_mode("classification")
# Error: 'lm' is not a valid engine for decision_tree()

Step 3 - Workflow

Bundle preprocessing and model into one object.

Step 3 - Bundle recipe and model

  • workflow() combines rec_base and tree_spec so you never preprocess the full table before splitting.
  • add_recipe() + add_model() - one object to pass to fit(), tune_grid(), or fit_resamples().
  • Inside each resample, the workflow prep → bake → train on analysis rows only.

Step 3 - workflow() code

wf <- workflow() |>
  add_recipe(rec_base) |>
  add_model(tree_spec)
wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_zv()
• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = tune()
  min_n = tune()

Computational engine: rpart 

Step 3 - One-shot fit (intuition)

For visualization only

fit(wf, data = peng_bal) on all rows helps us see a boundary. For honest reporting, keep a locked test set (or use CV metrics from Step 5).

  • fit(wf, data = peng_bal) on all rows is a quick way to see a decision boundary - not our final honest evaluation.
  • We use a fixed demo spec (tree_spec_demo) so the plot is stable before tuning.
  • After tuning (Step 5), we refit on all rows for teaching plots; in a project, keep a locked test set.

Step 3 - Fit and boundary (code)

wf_demo <- workflow() |>
  add_recipe(rec_base) |>
  add_model(tree_spec_demo)
fit_demo <- fit(wf_demo, data = peng_bal)

Afternoon lab — Task 2.3

Tip

Workflow and train/test treeworkflow(), initial_split, fit tree, predict the held-out test set.

Lab exercises · Solution · Download Rmd

Logistic boundary on penguins (same bill plane)

A smooth linear boundary from logistic_reg() - contrast with the rectangular tree regions above.

Afternoon lab — Task 2.4

Tip

Add logistic regression — second workflow with the same recipe and logistic_reg + glm engine; same train/test split.

Lab exercises · Solution · Download Rmd

Step 4 - Metrics

Decide how to score predictions before you tune.

Step 4 - Choose a metric before tuning

  • Today we use accuracy - fraction of correct class labels; fine when classes are balanced and errors cost the same.
  • Thursday: we will add sensitivity, ROC-AUC, PR-AUC when the positive class is rare - accuracy can mislead.
  • metric_set(accuracy) bundles scorers for tune_grid() and fit_resamples().

Step 4 - metric_set code

metrics_day2
A metric set, consisting of:
- `accuracy()`, a class metric | direction: maximize

Afternoon lab — Task 2.5

Tip

Compare on the test set — accuracy (and optionally ROC AUC) for tree vs logistic on test rows only.

Lab exercises · Solution · Download Rmd

Step 5 - Cross-validation and tuning

Search tree_depth and min_n on held-out folds, then finalize.

fit() vs fit_resamples() vs tune_grid()

Function Data splits Parameter combinations
fit() 1 training set 1
fit_resamples() Many folds/resamples 1
tune_grid() Many folds/resamples Many
  • Use fit() when hyperparameters are already chosen.
  • Use fit_resamples() to estimate CV performance for one fixed workflow.
  • Use tune_grid() when you need to search over multiple hyperparameter values.

In this deck: fit_demo (one fit for plotting), rs_tree (CV with fixed tuned params), tune_res (CV + hyperparameter search).

Step 5 - Search hyperparameters on folds

  • folds <- vfold_cv(..., strata = y) - each fold keeps both species represented.
  • tune_grid(wf, resamples = folds, grid = ..., metrics = metrics_day2) tries many specs; preprocessing is refit inside each fold.
  • Grid spans stumps (depth 1, large min_n) to more flexible trees.

Step 5 - tune_grid code

set.seed(2)
tune_res <- tune_grid(
  wf,
  resamples = folds,
  grid = grid_regular(
    tree_depth(range = c(1, 6)),
    min_n(range = c(2, 50)),
    levels = 4
  ),
  metrics = metrics_day2
)

Step 5 - Read the tuning plot

  • autoplot(tune_res) shows mean accuracy vs tree_depth and min_n.
  • Expect lower accuracy at tree_depth = 1 (stump) and with large min_n (heavy pruning).
  • Pick a region that balances fit and simplicity before select_best().

Step 5 - Tuning plot (code)

tune_res |>
  autoplot() +
  theme_minimal() +
  labs(
    title = "Tuning: accuracy vs tree_depth and min_n",
    subtitle = "Bill + island + sex + year  -  no flipper/mass"
  )

The honest evaluation workflow

You split your data into training and test sets, then use cross-validation on the training set to tune hyperparameters for each model and compute their average performance. For each model type, you select the best-performing hyperparameter combination based on CV results, and then compare these tuned models to choose the best overall model. Finally, you refit that selected model on the full training data and evaluate it once on the untouched test set to get an unbiased estimate of real-world performance.

Stage What happens tidymodels tools (examples)
Hold out test data Test rows stay unused until the very end initial_split(), training(), testing()
Tune on training only CV estimates average performance while searching hyperparameters vfold_cv(), tune_grid(), select_best()
Compare learners Same folds + recipe; pick the best model type after tuning each fit_resamples(), collect_metrics()
Final report Refit on all training rows; score once on test last_fit()
  • Recipe steps (dummy, scale, impute, PCA, upsample, …) must be fit inside each CV fold on training rows only - otherwise information leaks from held-out data.
  • fit_resamples() (next) averages CV metrics for a workflow - quick comparisons when specs are fixed or already tuned.
  • Chapter 4 — train/test split walks through last_fit() with a real train/test split.

Step 5 - Finalize and refit

  • select_best(tune_res, metric = "accuracy") - best hyperparameters from CV.
  • finalize_workflow(wf, best) - plug tuned values into the blueprint.
  • fit(final_wf, data = peng_cls) - one tree on all rows for plots; fit_resamples() reports honest CV performance.
  • In projects, use last_fit() with a true test set instead of fitting on all rows for reporting.

Step 5 - Finalize and CV metrics (code)

best <- select_best(tune_res, metric = "accuracy")
final_wf <- finalize_workflow(wf, best)
final_fit <- fit(final_wf, data = peng_cls)
best
# A tibble: 1 × 3
  tree_depth min_n .config         
       <int> <int> <chr>           
1          2     2 pre0_mod05_post0
set.seed(6)
wf_tree <- final_wf
rs_tree <- fit_resamples(wf_tree, folds, metrics = metrics_day2)
metrics_summary_table(rs_tree, "Tuned tree (5-fold CV)")
# A tibble: 1 × 4
  model                  .metric   mean std_err
  <chr>                  <chr>    <dbl>   <dbl>
1 Tuned tree (5-fold CV) accuracy 0.993 0.00458

Fitted tree structure

final_fit |>
  extract_fit_engine() |>
  rpart.plot(
    roundint = FALSE,
    type = 4,
    extra = 104,
    box.palette = "RdBu",
    main = "Final decision tree (rpart)"
  )

Variable importance (this tree)

final_fit |>
  extract_fit_parsnip() |>
  vip(geom = "point", aesthetics = list(color = "midnightblue", size = 3)) +
  theme_minimal() +
  labs(
    title = "Split-based importance (single tree)",
    subtitle = "Correlated bills ≠ causation  -  full VIP chapter on Thursday"
  )

Part D - Wrap-up

Transition to Day 4 (Thursday)

  • You have rec_base, workflow(), vfold_cv, and tune_grid with a tree
  • Thursday: same recipe - swap in forests, boosting, neural nets; add PCA, imputation, imbalance; richer metrics and VIP

End-of-day checklist

  • Can you name Breiman’s two cultures in one sentence each?
  • For the gene classification tree: what does one split do to the Gene1-Gene3 partition?
  • For the gene trait tree: what does a larger cp do to the number of regions?
  • What is the difference between training, CV, and test rows in the penguin workflow?
  • Which three objects does workflow() bundle?