rpart): recursive splits, partitions, and interpretable rules|> chains for tidy tables and tidymodels preprocessingrecipe → spec → workflow → tune → fit on Palmer Penguins with decision_tree + rpart and accuracy on cross-validationLeo Breiman’s Statistical Modeling: The Two Cultures argues the useful split is not “statistics vs machine learning.”
lm, glm, lasso on known structure)Linear models force us to look at the world through straight lines and additive math (glm). If two variables interact, we have to manually tell the model to multiply them. Decision trees change that. They naturally capture complex, non-linear relationships by asking a sequence of simple, nested questions.
Today we will implement this splitting logic using the tidymodels framework, keeping rpart as our underlying engine.
rpart in one sliderpart(factor_y ~ ., data = ..., method = "class") - splits improve class purity per cellrpart(trait ~ GeneA + GeneB, data = ...) - numeric y uses method = "anova"; each leaf predicts a mean traitrpart.control())Set in control = rpart.control(cp = ..., minsplit = ..., maxdepth = ...) - next slide shows the effect of each knob.
rpart.control(): cp, minsplit, maxdepthrpart.control() |
What it does | Turn it up → |
|---|---|---|
cp |
Complexity penalty (pruning) | Simpler tree (fewer splits) |
minsplit |
Minimum cases in a node before trying another split | Fewer split attempts (coarser tree) |
maxdepth |
Maximum depth from root to leaf | Shallower tree |
tidymodels names (this afternoon): cost_complexity ≈ cp, min_n ≈ minsplit, tree_depth ≈ maxdepth.
cp: pruning / cost-complexity trade-off (also tuned via $cptable later).minsplit: stops splitting when nodes are too small (stability on small n).maxdepth: hard cap on how many questions the tree can ask in a row.Binary disease from simulated gene expression - instructors know causal genes (Gene1, Gene3, Gene4, interaction)
Classification tree on all genes
Visualized as a tree and in the Gene1 × Gene3 plane (other genes kept constant at their means)
Notebook: logistic-regression-gene-disease.Rmd
rpart.plot)Other genes fixed at their sample means - same 2D slice as Monday’s logistic boundary.
Four steps (one new split each): tree (left, gold box + node number) and partition in Gene1 × Gene3 (right, dashed line when the new split uses those genes).
New split on Gene3 (node 1) - highlighted in gold on the tree.
New split on Gene1 (node 2) - highlighted in gold on the tree.
New split on Gene1 (node 3): >= 0.382. Gold box in the tree; dashed line in the partition when the split uses Gene1 or Gene3.
New split on Gene4 (node 6): >= -0.0971. Gold box in the tree; dashed line in the partition when the split uses Gene1 or Gene3.
| Model | Accuracy_in |
|---|---|
| Logistic (glm) | 0.74 |
| Decision tree (rpart) | 0.93 |
Same Gene1 × Gene3 slice; other genes at their means.
lm / lasso blockrpart(trait ~ GeneA + GeneB, cp = ...) - sweep cp from $cptable to show how the predicted surface and tree changeGene × trait - regression trees (rpart) on the same simulation as Monday (linear-regression-lasso.Rmd): continuous trait from GeneA and GeneB.
rpart(trait ~ GeneA + GeneB, ...) - default method = "anova" for a numeric y (splits minimize within-node variance)method = "class" for a factor outcomecp values: linear-regression-lasso.Rmdcp (pruning)cp in rpart(..., cp = ...) keeps splits only if they improve the fit enough relative to tree size$cptable lists candidate cp values (we skip the first row, as in the notebook)cp → simpler tree (fewer splits); smaller cp → more regions, smoother mosaic of predicted traitcp = 0.112 - 2 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.
cp = 0.0493 - 6 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.
cp = 0.0225 - 12 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.
cp = 0.01 - 16 splits. Each region predicts a constant trait (piecewise average); background colour is the model’s prediction on a grid.
| Model | RMSE_in |
|---|---|
| Linear (GeneA + GeneB) | 0.290 |
| rpart (cp = 0.01) | 0.416 |
Real data for the rest of Tuesday and again on Thursday: bill measurements, species labels, and honest evaluation.
Artwork: Allison Horst / Palmer Penguins LTER study (teaching use).
rpart to a reusable pipelineShort tidyverse refresher before the tidymodels pipeline (same pipe grammar as recipes and workflows).
|>x |> f() means: take the result of x and pass it as the first argument to f()recipe() |> step_*()filter and select# A tibble: 4 × 6
species island sex bill_length_mm bill_depth_mm body_mass_g
<fct> <fct> <fct> <dbl> <dbl> <int>
1 Adelie Torgersen male 39.1 18.7 3750
2 Adelie Torgersen female 39.5 17.4 3800
3 Adelie Torgersen female 40.3 18 3250
4 Adelie Torgersen female 36.7 19.3 3450
mutate (new columns)# A tibble: 4 × 4
species bill_length_mm bill_depth_mm bill_ratio
<fct> <dbl> <dbl> <dbl>
1 Adelie 39.1 18.7 2.09
2 Adelie 39.5 17.4 2.27
3 Adelie 40.3 18 2.24
4 Adelie 36.7 19.3 1.90
tidymodelspenguins |> filter(...) |> mutate(...)recipe(y ~ ., data = train) |> step_dummy() |> step_normalize()workflow() |> add_recipe(rec) |> add_model(spec) - here you chain objects, not tables, but the read left-to-right habit is the samey: Adelie vs Gentoo (complete cases).tree_depth and min_n show a real tuning tradeoff).decision_tree() + rpart only - forests and boosting come on Thursday.No code yet - the shape of every tidymodels project this week:
recipe / spec / workflow are blueprints; only fit(), fit_resamples(), and tune_grid() touch rows.prep() learns rules on training rows; bake() applies them.recipe + spec so one fit() runs preprocessing and training in the right order.accuracy today).vfold_cv folds, pick the best, then fit for plots and interpretation.| Package | Job today |
|---|---|
recipes |
Preprocessing checklist (dummy, scale, …) |
parsnip |
Model blueprint (decision_tree, set_engine("rpart")) |
workflows |
Glue recipe + model |
rsample |
Folds (vfold_cv) - used in Step 5 |
tune |
Search tree_depth and min_n |
yardstick |
accuracy scoring |
recipe)Turn raw columns into a model-ready table - in order, and only on training rows inside each fold.
prep() learns parameters on training rows; bake() applies those rules to new rows without re-learning.workflow() runs prep + bake + fit in the right order inside each CV fold.recipe(y ~ ., data = peng_bal) - y is the outcome; . = all other columns are predictors.step_zv() - drop zero-variance columnsstep_dummy() - encode categoriesisland and sex → numeric 0/1 columns for rpart.step_normalize() - scale numeric predictorsprep() and bake() — learn rules, then applyprep(recipe, training = train) estimates each step on training rows only (means for scaling, dummy levels, etc.).bake(prep, new_data = ...) applies those rules to any table (train, test, or one new penguin) without re-learning.workflow() calls prep + bake + fit in the right order inside each CV fold - you rarely call them by hand except for checks and tools like SHAP (Thursday).set.seed(2)
split_demo <- initial_split(peng_bal, prop = 0.75, strata = y)
train_demo <- training(split_demo)
test_demo <- testing(split_demo)
rec_prep <- prep(rec_base, training = train_demo)
bake_train <- bake(rec_prep, new_data = train_demo)
bake_test <- bake(rec_prep, new_data = test_demo)
dplyr::glimpse(bake_train)Rows: 198
Columns: 7
$ bill_length_mm <dbl> -0.68460562, -0.60823834, -0.45550379, -1.14280927, -…
$ bill_depth_mm <dbl> 0.96969431, 0.30972785, 0.61432775, 1.27429421, 1.934…
$ year <dbl> -1.293173, -1.293173, -1.293173, -1.293173, -1.293173…
$ y <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adeli…
$ island_Dream <dbl> -0.5175626, -0.5175626, -0.5175626, -0.5175626, -0.51…
$ island_Torgersen <dbl> 2.1159567, 2.1159567, 2.1159567, 2.1159567, 2.1159567…
$ sex_male <dbl> 0.9874465, -1.0075984, -1.0075984, -1.0075984, 0.9874…
bake_train: model-ready columns the tree sees during fit().bake_test: same encoding using training statistics - that is what prevents leakage.Tip
Build the recipe — log1p on OTUs, step_zv, step_normalize on the microbiome table (Label = Early vs Late).
spec)Choose the learner and its settings - still no fitting.
parsnip stores the model family and hyperparameters in a spec object.set_engine("rpart") picks the R implementation; set_mode("classification") matches a factor outcome.tune() marks arguments we will search later - the spec stays a blueprint until fit().rpartDecision Tree Model Specification (classification)
Main Arguments:
tree_depth = 4
min_n = 10
Computational engine: rpart
decision_tree() - fixed parsnip function name.tree_depth, min_n - depth limit and minimum leaf size.set_engine("rpart") - same engine as the gene trees this morning.set_mode("classification") - factor outcome y.Tip
Choose the tree model — decision_tree() + set_engine("rpart"), classification mode.
tune())Decision Tree Model Specification (classification)
Main Arguments:
tree_depth = tune()
min_n = tune()
Computational engine: rpart
tune() marks arguments tune_grid() will search - we pick winners by cross-validated accuracy.| Model type | Valid engines (examples) | Modes |
|---|---|---|
decision_tree() |
"rpart", "C5.0" |
classification, regression |
rand_forest() |
"ranger", "randomForest" |
classification, regression |
boost_tree() |
"xgboost", "lightgbm" |
classification, regression |
Thursday swaps spec - same recipe and workflow() pattern.
Bundle preprocessing and model into one object.
workflow() combines rec_base and tree_spec so you never preprocess the full table before splitting.add_recipe() + add_model() - one object to pass to fit(), tune_grid(), or fit_resamples().workflow() code══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_zv()
• step_dummy()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (classification)
Main Arguments:
tree_depth = tune()
min_n = tune()
Computational engine: rpart
For visualization only
fit(wf, data = peng_bal) on all rows helps us see a boundary. For honest reporting, keep a locked test set (or use CV metrics from Step 5).
fit(wf, data = peng_bal) on all rows is a quick way to see a decision boundary - not our final honest evaluation.tree_spec_demo) so the plot is stable before tuning.Tip
Workflow and train/test tree — workflow(), initial_split, fit tree, predict the held-out test set.
A smooth linear boundary from logistic_reg() - contrast with the rectangular tree regions above.
Tip
Add logistic regression — second workflow with the same recipe and logistic_reg + glm engine; same train/test split.
Decide how to score predictions before you tune.
accuracy - fraction of correct class labels; fine when classes are balanced and errors cost the same.metric_set(accuracy) bundles scorers for tune_grid() and fit_resamples().metric_set codeTip
Compare on the test set — accuracy (and optionally ROC AUC) for tree vs logistic on test rows only.
Search tree_depth and min_n on held-out folds, then finalize.
fit() vs fit_resamples() vs tune_grid()| Function | Data splits | Parameter combinations |
|---|---|---|
fit() |
1 training set | 1 |
fit_resamples() |
Many folds/resamples | 1 |
tune_grid() |
Many folds/resamples | Many |
fit() when hyperparameters are already chosen.fit_resamples() to estimate CV performance for one fixed workflow.tune_grid() when you need to search over multiple hyperparameter values.In this deck: fit_demo (one fit for plotting), rs_tree (CV with fixed tuned params), tune_res (CV + hyperparameter search).
folds <- vfold_cv(..., strata = y) - each fold keeps both species represented.tune_grid(wf, resamples = folds, grid = ..., metrics = metrics_day2) tries many specs; preprocessing is refit inside each fold.min_n) to more flexible trees.tune_grid codeautoplot(tune_res) shows mean accuracy vs tree_depth and min_n.tree_depth = 1 (stump) and with large min_n (heavy pruning).select_best().You split your data into training and test sets, then use cross-validation on the training set to tune hyperparameters for each model and compute their average performance. For each model type, you select the best-performing hyperparameter combination based on CV results, and then compare these tuned models to choose the best overall model. Finally, you refit that selected model on the full training data and evaluate it once on the untouched test set to get an unbiased estimate of real-world performance.
| Stage | What happens | tidymodels tools (examples) |
|---|---|---|
| Hold out test data | Test rows stay unused until the very end | initial_split(), training(), testing() |
| Tune on training only | CV estimates average performance while searching hyperparameters | vfold_cv(), tune_grid(), select_best() |
| Compare learners | Same folds + recipe; pick the best model type after tuning each | fit_resamples(), collect_metrics() |
| Final report | Refit on all training rows; score once on test | last_fit() |
fit_resamples() (next) averages CV metrics for a workflow - quick comparisons when specs are fixed or already tuned.last_fit() with a real train/test split.select_best(tune_res, metric = "accuracy") - best hyperparameters from CV.finalize_workflow(wf, best) - plug tuned values into the blueprint.fit(final_wf, data = peng_cls) - one tree on all rows for plots; fit_resamples() reports honest CV performance.last_fit() with a true test set instead of fitting on all rows for reporting.# A tibble: 1 × 3
tree_depth min_n .config
<int> <int> <chr>
1 2 2 pre0_mod05_post0
rec_base, workflow(), vfold_cv, and tune_grid with a treecp do to the number of regions?workflow() bundle?