Day 2 (Tuesday): Trees, rpart, and the tidymodels Pipeline

Learning objectives

Frame Breiman’s two cultures: same goals and problems, different modeling approaches (not “stats vs ML”)
Build intuition for decision trees (rpart): recursive splits, partitions, and interpretable rules
Walk through gene × disease (classification) and gene × trait (regression) trees step by step
Read |> chains for tidy tables and tidymodels preprocessing
Build recipe → spec → workflow → tune → fit on Palmer Penguins with decision_tree + rpart and accuracy on cross-validation

Part A - Two cultures and trees

Two cultures (Breiman, 2001)

Leo Breiman’s Statistical Modeling: The Two Cultures argues the useful split is not “statistics vs machine learning.”

Scientists often share the same goals (learn patterns, predict, support decisions) and work on the same kinds of data
What differs is how you build a model - what you assume up front and what you optimize

Data modeling culture

Starts with a stochastic model one can write down (Monday: lm, glm, lasso on known structure)
Emphasizes parameters, uncertainty, and whether the model class is adequate
Strength: transparent assumptions
Risk: the model may be wrong in shape even when it fits “well”

Algorithmic modeling culture

Treat the data-generating mechanism as largely unknown
Let a flexible algorithm find structure; judge it by predictive performance on data not used to fit
Strength: captures interactions and nonlinear boundaries
Risk: needs careful evaluation so flexibility does not fool you on the training data (overfitting)

Decision Trees

Linear models force us to look at the world through straight lines and additive math (glm). If two variables interact, we have to manually tell the model to multiply them. Decision trees change that. They naturally capture complex, non-linear relationships by asking a sequence of simple, nested questions.

Automatic Interactions: Captures non-linear boundaries and complex combinations of variables without you having to manually write out interaction terms.
Human-Readable Decision Logic: Every prediction can be traced back to a clear, if-then story based on actual measurements.

Today we will implement this splitting logic using the tidymodels framework, keeping rpart as our underlying engine.

`rpart` in one slide

Function call:

Classification: rpart(factor_y ~ ., data = ..., method = "class") - splits improve class purity per cell
Regression: rpart(trait ~ GeneA + GeneB, data = ...) - numeric y uses method = "anova"; each leaf predicts a mean trait

Tuning parameters (`rpart.control()`)

Set in control = rpart.control(cp = ..., minsplit = ..., maxdepth = ...) - next slide shows the effect of each knob.

`rpart.control()`: cp, minsplit, maxdepth

`rpart.control()`	What it does	Turn it up →
`cp`	Complexity penalty (pruning)	Simpler tree (fewer splits)
`minsplit`	Minimum cases in a node before trying another split	Fewer split attempts (coarser tree)
`maxdepth`	Maximum depth from root to leaf	Shallower tree

tidymodels names (this afternoon): cost_complexity ≈ cp, min_n ≈ minsplit, tree_depth ≈ maxdepth.

Same toy data in every panel - only one control changes per row.
cp: pruning / cost-complexity trade-off (also tuned via $cptable later).
minsplit: stops splitting when nodes are too small (stability on small n).
maxdepth: hard cap on how many questions the tree can ask in a row.

Part B - Gene simulations

Gene × disease (same simulation as Monday)

Binary disease from simulated gene expression - instructors know causal genes (Gene1, Gene3, Gene4, interaction)
Classification tree on all genes
Visualized as a tree and in the Gene1 × Gene3 plane (other genes kept constant at their means)
Notebook: logistic-regression-gene-disease.Rmd

Final tree (`rpart.plot`)

Final decision surface (Gene1 vs Gene3)

Other genes fixed at their sample means - same 2D slice as Monday’s logistic boundary.

Growing the tree (split by split)

Four steps (one new split each): tree (left, gold box + node number) and partition in Gene1 × Gene3 (right, dashed line when the new split uses those genes).

Growing the tree: split 1

New split on Gene3 (node 1) - highlighted in gold on the tree.

Growing the tree: split 2

New split on Gene1 (node 2) - highlighted in gold on the tree.

Growing the tree: split 3

New split on Gene1 (node 3): >= 0.382. Gold box in the tree; dashed line in the partition when the split uses Gene1 or Gene3.

Growing the tree: split 4

New split on Gene4 (node 6): >= -0.0971. Gold box in the tree; dashed line in the partition when the split uses Gene1 or Gene3.

Tree vs logistic (same data, in-sample accuracy)

In-sample accuracy on all n rows (Monday vs Tuesday model class)
Model	Accuracy_in
Logistic (glm)	0.74
Decision tree (rpart)	0.93

Tree vs logistic - boundaries side by side

Same Gene1 × Gene3 slice; other genes at their means.

Gene × trait (Monday simulation, regression tree)

Continuous trait from GeneA / GeneB - same ground truth as Monday’s lm / lasso block
rpart(trait ~ GeneA + GeneB, cp = ...) - sweep cp from $cptable to show how the predicted surface and tree change
Notebook: linear-regression-lasso.Rmd

Gene × trait - regression trees (rpart) on the same simulation as Monday (linear-regression-lasso.Rmd): continuous trait from GeneA and GeneB.

Gene × trait (continuous outcome)

Same story as Monday: trait = -0.5 \times GeneA + 0.8 \times GeneB + noise (linear truth)
rpart(trait ~ GeneA + GeneB, ...) - default method = "anova" for a numeric y (splits minimize within-node variance)
Contrast with disease block: method = "class" for a factor outcome
Notebook loop over cp values: linear-regression-lasso.Rmd

Complexity penalty `cp` (pruning)

Fit a large tree first; cp in rpart(..., cp = ...) keeps splits only if they improve the fit enough relative to tree size
$cptable lists candidate cp values (we skip the first row, as in the notebook)
Larger cp → simpler tree (fewer splits); smaller cp → more regions, smoother mosaic of predicted trait

Trait tree: step 1 of 4

cp = 0.112 - 2 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.

Trait tree: step 2 of 4

cp = 0.0493 - 6 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.

Trait tree: step 3 of 4

cp = 0.0225 - 12 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.

Trait tree: step 4 of 4

cp = 0.01 - 16 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.

Trait: tree vs linear (in-sample RMSE)

In-sample RMSE on all n rows - same genes as Monday’s simple linear model
Model	RMSE_in
Linear (GeneA + GeneB)	0.290
rpart (cp = 0.01)	0.416

Part C - Palmer Penguins: tidymodels pipeline (trees only)

Meet the Palmer penguins

Real data for the rest of Tuesday and again on Thursday: bill measurements, species labels, and honest evaluation.

Artwork: Allison Horst / Palmer Penguins LTER study (teaching use).

From manual `rpart` to a reusable pipeline

Monday compared models on all rows; today we hold out information via cross-validation while tuning
One workflow runs preprocessing + tree on each fold - same discipline as the microbiome lab this afternoon
Dataset: Palmer Penguins data card - Adelie vs Gentoo; bill + island + sex + year

Short tidyverse refresher before the tidymodels pipeline (same pipe grammar as recipes and workflows).

Tidy data and the pipe `|>`

Each column is one variable; each row is one penguin (or one sample)
x |> f() means: take the result of x and pass it as the first argument to f()
Read chains top to bottom - same habit you will use for recipe() |> step_*()

Example: `filter` and `select`

pg <- palmerpenguins::penguins |>
  filter(!is.na(sex), !is.na(bill_length_mm), !is.na(bill_depth_mm)) |>
  select(species, island, sex, bill_length_mm, bill_depth_mm, body_mass_g)

head(pg, 4)

# A tibble: 4 × 6
  species island    sex    bill_length_mm bill_depth_mm body_mass_g
  <fct>   <fct>     <fct>           <dbl>         <dbl>       <int>
1 Adelie  Torgersen male             39.1          18.7        3750
2 Adelie  Torgersen female           39.5          17.4        3800
3 Adelie  Torgersen female           40.3          18          3250
4 Adelie  Torgersen female           36.7          19.3        3450

Example: `mutate` (new columns)

pg |>
  mutate(bill_ratio = bill_length_mm / bill_depth_mm) |>
  select(species, bill_length_mm, bill_depth_mm, bill_ratio) |>
  head(4)

# A tibble: 4 × 4
  species bill_length_mm bill_depth_mm bill_ratio
  <fct>            <dbl>         <dbl>      <dbl>
1 Adelie            39.1          18.7       2.09
2 Adelie            39.5          17.4       2.27
3 Adelie            40.3          18         2.24
4 Adelie            36.7          19.3       1.90

Same grammar in `tidymodels`

Table pipelines: penguins |> filter(...) |> mutate(...)
Preprocessing: recipe(y ~ ., data = train) |> step_dummy() |> step_normalize()
Modeling objects: workflow() |> add_recipe(rec) |> add_model(spec) - here you chain objects, not tables, but the read left-to-right habit is the same

Palmer Penguins: one table, one model family (trees)

Outcome y: Adelie vs Gentoo (complete cases).
Predictors: bill length & depth, island, sex, year - no flipper or body mass (so tree_depth and min_n show a real tuning tradeoff).
Today’s engine: decision_tree() + rpart only - forests and boosting come on Thursday.

The pipeline in principle

No code yet - the shape of every tidymodels project this week:

flowchart LR
  data[Raw table] --> recipe[recipe prep bake]
  recipe --> spec[parsnip spec]
  spec --> wf[workflow]
  folds[vfold_cv folds] --> tune[tune_grid]
  metrics[metric_set] --> tune
  wf --> tune
  tune --> best[select_best finalize]
  best --> fit[fit predict]
  fit --> out[Plots and tables]

Recipe steps run inside each CV fold - never scale or encode on the full table before splitting (avoids leakage).
recipe / spec / workflow are blueprints; only fit(), fit_resamples(), and tune_grid() touch rows.
Today’s score: accuracy only; Thursday adds metrics for imbalanced data.

Five steps - what each one does

Recipe - declare preprocessing steps (drop useless columns, dummy-code categories, scale numerics). prep() learns rules on training rows; bake() applies them.
Spec - choose model type, engine, mode, and which arguments to tune. Still a blueprint - not fitted.
Workflow - glue recipe + spec so one fit() runs preprocessing and training in the right order.
Metrics - define how to judge predictions (accuracy today).
Resample and tune - search hyperparameters on vfold_cv folds, pick the best, then fit for plots and interpretation.

Packages in the pipeline (Tuesday focus)

Package	Job today
`recipes`	Preprocessing checklist (dummy, scale, …)
`parsnip`	Model blueprint (`decision_tree`, `set_engine("rpart")`)
`workflows`	Glue recipe + model
`rsample`	Folds (`vfold_cv`) - used in Step 5
`tune`	Search `tree_depth` and `min_n`
`yardstick`	`accuracy` scoring

Step 1 - Preprocessing (`recipe`)

Turn raw columns into a model-ready table - in order, and only on training rows inside each fold.

Step 1 - Why a recipe?

A recipe is a checklist of transforms applied before the model sees the data.
Steps run top to bottom; order matters (e.g. dummy-code before scaling).
prep() learns parameters on training rows; bake() applies those rules to new rows without re-learning.
A workflow() runs prep + bake + fit in the right order inside each CV fold.

Recipe pipeline (visual)

Start the recipe

rec <- recipe(y ~ ., data = peng_bal)
rec

recipe(y ~ ., data = peng_bal) - y is the outcome; . = all other columns are predictors.
Blueprint only - nothing transformed yet.

`step_zv()` - drop zero-variance columns

rec <- rec |>
  step_zv(all_predictors())
rec

Removes columns with one unique value (no predictive power; can break algorithms in small folds).

`step_dummy()` - encode categories

rec <- rec |>
  step_dummy(all_nominal_predictors())
rec

island and sex → numeric 0/1 columns for rpart.

`step_normalize()` - scale numeric predictors

rec <- rec |>
  step_normalize(all_numeric_predictors())
rec

Z-score each numeric column using training-fold rows only inside CV.
Good habit before you swap in other model types later in the week.

rec_base <- rec
rec_base

`prep()` and `bake()` — learn rules, then apply

prep(recipe, training = train) estimates each step on training rows only (means for scaling, dummy levels, etc.).
bake(prep, new_data = ...) applies those rules to any table (train, test, or one new penguin) without re-learning.
A workflow() calls prep + bake + fit in the right order inside each CV fold - you rarely call them by hand except for checks and tools like SHAP (Thursday).

set.seed(2)
split_demo <- initial_split(peng_bal, prop = 0.75, strata = y)
train_demo <- training(split_demo)
test_demo <- testing(split_demo)

rec_prep <- prep(rec_base, training = train_demo)
bake_train <- bake(rec_prep, new_data = train_demo)
bake_test <- bake(rec_prep, new_data = test_demo)

dplyr::glimpse(bake_train)

Rows: 198
Columns: 7
$ bill_length_mm   <dbl> -0.68460562, -0.60823834, -0.45550379, -1.14280927, -…
$ bill_depth_mm    <dbl> 0.96969431, 0.30972785, 0.61432775, 1.27429421, 1.934…
$ year             <dbl> -1.293173, -1.293173, -1.293173, -1.293173, -1.293173…
$ y                <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adeli…
$ island_Dream     <dbl> -0.5175626, -0.5175626, -0.5175626, -0.5175626, -0.51…
$ island_Torgersen <dbl> 2.1159567, 2.1159567, 2.1159567, 2.1159567, 2.1159567…
$ sex_male         <dbl> 0.9874465, -1.0075984, -1.0075984, -1.0075984, 0.9874…

bake_train: model-ready columns the tree sees during fit().
bake_test: same encoding using training statistics - that is what prevents leakage.

Afternoon lab — Task 2.1

Tip

Build the recipe — log1p on OTUs, step_zv, step_normalize on the microbiome table (Label = Early vs Late).

Lab exercises · Solution · Download Rmd

Step 2 - Model blueprint (`spec`)

Choose the learner and its settings - still no fitting.

Step 2 - What is a spec?

parsnip stores the model family and hyperparameters in a spec object.
set_engine("rpart") picks the R implementation; set_mode("classification") matches a factor outcome.
tune() marks arguments we will search later - the spec stays a blueprint until fit().

Classification tree with `rpart`

tree_spec_demo <- decision_tree(tree_depth = 4, min_n = 10) |>
  set_engine("rpart") |>
  set_mode("classification")
tree_spec_demo

Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = 4
  min_n = 10

Computational engine: rpart

decision_tree() - fixed parsnip function name.
tree_depth, min_n - depth limit and minimum leaf size.
set_engine("rpart") - same engine as the gene trees this morning.
set_mode("classification") - factor outcome y.

Afternoon lab — Task 2.2

Tip

Choose the tree model — decision_tree() + set_engine("rpart"), classification mode.

Lab exercises · Solution · Download Rmd

Tuning knobs (`tune()`)

tree_spec <- decision_tree(
  tree_depth = tune(),
  min_n = tune()
) |>
  set_engine("rpart") |>
  set_mode("classification")
tree_spec

Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = tune()
  min_n = tune()

Computational engine: rpart

tune() marks arguments tune_grid() will search - we pick winners by cross-validated accuracy.

Parsnip: type → engine → mode

Model type	Valid engines (examples)	Modes
`decision_tree()`	`"rpart"`, `"C5.0"`	classification, regression
`rand_forest()`	`"ranger"`, `"randomForest"`	classification, regression
`boost_tree()`	`"xgboost"`, `"lightgbm"`	classification, regression

Thursday swaps spec - same recipe and workflow() pattern.

decision_tree() |>
  set_engine("lm") |>
  set_mode("classification")
# Error: 'lm' is not a valid engine for decision_tree()

Step 3 - Workflow

Bundle preprocessing and model into one object.

Step 3 - Bundle recipe and model

workflow() combines rec_base and tree_spec so you never preprocess the full table before splitting.
add_recipe() + add_model() - one object to pass to fit(), tune_grid(), or fit_resamples().
Inside each resample, the workflow prep → bake → train on analysis rows only.

Step 3 - `workflow()` code

wf <- workflow() |>
  add_recipe(rec_base) |>
  add_model(tree_spec)
wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_zv()
• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = tune()
  min_n = tune()

Computational engine: rpart

Step 3 - One-shot fit (intuition)

For visualization only

fit(wf, data = peng_bal) on all rows helps us see a boundary. For honest reporting, keep a locked test set (or use CV metrics from Step 5).

fit(wf, data = peng_bal) on all rows is a quick way to see a decision boundary - not our final honest evaluation.
We use a fixed demo spec (tree_spec_demo) so the plot is stable before tuning.
After tuning (Step 5), we refit on all rows for teaching plots; in a project, keep a locked test set.

Step 3 - Fit and boundary (code)

wf_demo <- workflow() |>
  add_recipe(rec_base) |>
  add_model(tree_spec_demo)
fit_demo <- fit(wf_demo, data = peng_bal)

Afternoon lab — Task 2.3

Tip

Workflow and train/test tree — workflow(), initial_split, fit tree, predict the held-out test set.

Lab exercises · Solution · Download Rmd

Logistic boundary on penguins (same bill plane)

A smooth linear boundary from logistic_reg() - contrast with the rectangular tree regions above.

Afternoon lab — Task 2.4

Tip

Add logistic regression — second workflow with the same recipe and logistic_reg + glm engine; same train/test split.

Lab exercises · Solution · Download Rmd

Step 4 - Metrics

Decide how to score predictions before you tune.

Step 4 - Choose a metric before tuning

Today we use accuracy - fraction of correct class labels; fine when classes are balanced and errors cost the same.
Thursday: we will add sensitivity, ROC-AUC, PR-AUC when the positive class is rare - accuracy can mislead.
metric_set(accuracy) bundles scorers for tune_grid() and fit_resamples().

Step 4 - `metric_set` code

metrics_day2

A metric set, consisting of:
- `accuracy()`, a class metric | direction: maximize

Afternoon lab — Task 2.5

Tip

Compare on the test set — accuracy (and optionally ROC AUC) for tree vs logistic on test rows only.

Lab exercises · Solution · Download Rmd

Step 5 - Cross-validation and tuning

Search tree_depth and min_n on held-out folds, then finalize.

`fit()` vs `fit_resamples()` vs `tune_grid()`

Function	Data splits	Parameter combinations
`fit()`	1 training set	1
`fit_resamples()`	Many folds/resamples	1
`tune_grid()`	Many folds/resamples	Many

Use fit() when hyperparameters are already chosen.
Use fit_resamples() to estimate CV performance for one fixed workflow.
Use tune_grid() when you need to search over multiple hyperparameter values.

In this deck: fit_demo (one fit for plotting), rs_tree (CV with fixed tuned params), tune_res (CV + hyperparameter search).

Step 5 - Search hyperparameters on folds

folds <- vfold_cv(..., strata = y) - each fold keeps both species represented.
tune_grid(wf, resamples = folds, grid = ..., metrics = metrics_day2) tries many specs; preprocessing is refit inside each fold.
Grid spans stumps (depth 1, large min_n) to more flexible trees.

Step 5 - `tune_grid` code

set.seed(2)
tune_res <- tune_grid(
  wf,
  resamples = folds,
  grid = grid_regular(
    tree_depth(range = c(1, 6)),
    min_n(range = c(2, 50)),
    levels = 4
  ),
  metrics = metrics_day2
)

Step 5 - Read the tuning plot

autoplot(tune_res) shows mean accuracy vs tree_depth and min_n.
Expect lower accuracy at tree_depth = 1 (stump) and with large min_n (heavy pruning).
Pick a region that balances fit and simplicity before select_best().

Step 5 - Tuning plot (code)

tune_res |>
  autoplot() +
  theme_minimal() +
  labs(
    title = "Tuning: accuracy vs tree_depth and min_n",
    subtitle = "Bill + island + sex + year  -  no flipper/mass"
  )

The honest evaluation workflow

You split your data into training and test sets, then use cross-validation on the training set to tune hyperparameters for each model and compute their average performance. For each model type, you select the best-performing hyperparameter combination based on CV results, and then compare these tuned models to choose the best overall model. Finally, you refit that selected model on the full training data and evaluate it once on the untouched test set to get an unbiased estimate of real-world performance.

Stage	What happens	tidymodels tools (examples)
Hold out test data	Test rows stay unused until the very end	`initial_split()`, `training()`, `testing()`
Tune on training only	CV estimates average performance while searching hyperparameters	`vfold_cv()`, `tune_grid()`, `select_best()`
Compare learners	Same folds + recipe; pick the best model type after tuning each	`fit_resamples()`, `collect_metrics()`
Final report	Refit on all training rows; score once on test	`last_fit()`

Recipe steps (dummy, scale, impute, PCA, upsample, …) must be fit inside each CV fold on training rows only - otherwise information leaks from held-out data.
fit_resamples() (next) averages CV metrics for a workflow - quick comparisons when specs are fixed or already tuned.
Chapter 4 — train/test split walks through last_fit() with a real train/test split.

Step 5 - Finalize and refit

select_best(tune_res, metric = "accuracy") - best hyperparameters from CV.
finalize_workflow(wf, best) - plug tuned values into the blueprint.
fit(final_wf, data = peng_cls) - one tree on all rows for plots; fit_resamples() reports honest CV performance.
In projects, use last_fit() with a true test set instead of fitting on all rows for reporting.

Step 5 - Finalize and CV metrics (code)

best <- select_best(tune_res, metric = "accuracy")
final_wf <- finalize_workflow(wf, best)
final_fit <- fit(final_wf, data = peng_cls)
best

# A tibble: 1 × 3
  tree_depth min_n .config         
       <int> <int> <chr>           
1          2     2 pre0_mod05_post0

set.seed(6)
wf_tree <- final_wf
rs_tree <- fit_resamples(wf_tree, folds, metrics = metrics_day2)
metrics_summary_table(rs_tree, "Tuned tree (5-fold CV)")

# A tibble: 1 × 4
  model                  .metric   mean std_err
  <chr>                  <chr>    <dbl>   <dbl>
1 Tuned tree (5-fold CV) accuracy 0.993 0.00458

Fitted tree structure

final_fit |>
  extract_fit_engine() |>
  rpart.plot(
    roundint = FALSE,
    type = 4,
    extra = 104,
    box.palette = "RdBu",
    main = "Final decision tree (rpart)"
  )

Variable importance (this tree)

final_fit |>
  extract_fit_parsnip() |>
  vip(geom = "point", aesthetics = list(color = "midnightblue", size = 3)) +
  theme_minimal() +
  labs(
    title = "Split-based importance (single tree)",
    subtitle = "Correlated bills ≠ causation  -  full VIP chapter on Thursday"
  )

Part D - Wrap-up

Transition to Day 4 (Thursday)

You have rec_base, workflow(), vfold_cv, and tune_grid with a tree
Thursday: same recipe - swap in forests, boosting, neural nets; add PCA, imputation, imbalance; richer metrics and VIP

End-of-day checklist

Can you name Breiman’s two cultures in one sentence each?
For the gene classification tree: what does one split do to the Gene1-Gene3 partition?
For the gene trait tree: what does a larger cp do to the number of regions?
What is the difference between training, CV, and test rows in the penguin workflow?
Which three objects does workflow() bundle?

Day 2 (Tuesday): Trees, rpart, and the tidymodels Pipeline

Learning objectives

Navigation

Part A - Two cultures and trees

Two cultures (Breiman, 2001)

Data modeling culture

Algorithmic modeling culture

Decision Trees

rpart in one slide

Function call:

Tuning parameters (rpart.control())

rpart.control(): cp, minsplit, maxdepth

Part B - Gene simulations

Gene × disease (same simulation as Monday)

Final tree (rpart.plot)

Final decision surface (Gene1 vs Gene3)

Growing the tree (split by split)

Growing the tree: split 1

Growing the tree: split 2

Growing the tree: split 3

Growing the tree: split 4

Tree vs logistic (same data, in-sample accuracy)

Tree vs logistic - boundaries side by side

Gene × trait (Monday simulation, regression tree)

Gene × trait (continuous outcome)

Complexity penalty cp (pruning)

Trait tree: step 1 of 4

Trait tree: step 2 of 4

Trait tree: step 3 of 4

Trait tree: step 4 of 4

Trait: tree vs linear (in-sample RMSE)

Part C - Palmer Penguins: tidymodels pipeline (trees only)

Meet the Palmer penguins

From manual rpart to a reusable pipeline

Tidy data and the pipe |>

Example: filter and select

Example: mutate (new columns)

Same grammar in tidymodels

Palmer Penguins: one table, one model family (trees)

The pipeline in principle

Five steps - what each one does

Packages in the pipeline (Tuesday focus)

Step 1 - Preprocessing (recipe)

Step 1 - Why a recipe?

Recipe pipeline (visual)

Start the recipe

step_zv() - drop zero-variance columns

step_dummy() - encode categories

step_normalize() - scale numeric predictors

prep() and bake() — learn rules, then apply

Afternoon lab — Task 2.1

Step 2 - Model blueprint (spec)

Step 2 - What is a spec?

Classification tree with rpart

Afternoon lab — Task 2.2

Tuning knobs (tune())

Parsnip: type → engine → mode

Step 3 - Workflow

Step 3 - Bundle recipe and model

Step 3 - workflow() code

Step 3 - One-shot fit (intuition)

Step 3 - Fit and boundary (code)

Afternoon lab — Task 2.3

Logistic boundary on penguins (same bill plane)

Afternoon lab — Task 2.4

Step 4 - Metrics

Step 4 - Choose a metric before tuning

Step 4 - metric_set code

Afternoon lab — Task 2.5

Step 5 - Cross-validation and tuning

fit() vs fit_resamples() vs tune_grid()

Step 5 - Search hyperparameters on folds

Step 5 - tune_grid code

Step 5 - Read the tuning plot

Step 5 - Tuning plot (code)

The honest evaluation workflow

Step 5 - Finalize and refit

Step 5 - Finalize and CV metrics (code)

Fitted tree structure

Variable importance (this tree)

`rpart` in one slide

Tuning parameters (`rpart.control()`)

`rpart.control()`: cp, minsplit, maxdepth

Final tree (`rpart.plot`)

Complexity penalty `cp` (pruning)

From manual `rpart` to a reusable pipeline

Tidy data and the pipe `|>`

Example: `filter` and `select`

Example: `mutate` (new columns)

Same grammar in `tidymodels`

Step 1 - Preprocessing (`recipe`)

`step_zv()` - drop zero-variance columns

`step_dummy()` - encode categories

`step_normalize()` - scale numeric predictors

`prep()` and `bake()` — learn rules, then apply

Step 2 - Model blueprint (`spec`)

Classification tree with `rpart`

Tuning knobs (`tune()`)

Step 3 - `workflow()` code

Step 4 - `metric_set` code

`fit()` vs `fit_resamples()` vs `tune_grid()`

Step 5 - `tune_grid` code