Chapter 3: Rules and trees

Trees are often the first model students find intuitive, because they mirror everyday reasoning: “if this, then that.” The model itself is a sequence of split decisions that route each row to a terminal prediction.

This chapter keeps that intuition but adds discipline. A tree is still a statistical model: it can overfit, it needs honest validation, and its interpretability does not remove uncertainty.

← Reading companion · Back: Chapter 2 · Next: Chapter 4 — A reproducible modeling workflow →

Two cultures, one practical goal

Breiman’s “two cultures” framing helps explain why this week uses both classic models and flexible algorithms.

The assumptions-first tradition emphasizes interpretable parameters and explicit model forms.
The prediction-first tradition emphasizes held-out performance and flexible function classes.

In practice, both traditions need the same scientific discipline: clear outcomes, leakage control, and honest evaluation.

How a decision tree actually behaves

A classification tree repeatedly asks split questions such as “bill length <= threshold?” and sends rows left or right. Each split tries to create purer child nodes than the parent node.

This gives you nonlinear boundaries and interaction effects without manually writing interaction terms. It also means that uncontrolled depth can chase noise.

Practical tuning controls

Idea	`rpart` language	`tidymodels` language
Limit depth	`maxdepth`	`tree_depth`
Require enough rows per split/leaf	`minsplit`, `minbucket`	`min_n`
Penalize complexity	`cp`	tune depth and `min_n` (with `rpart` engine)

Minimal pruning example

library(rpart)

fit <- rpart(y ~ ., data = train, method = "class")
printcp(fit)
plotcp(fit)

cp_star <- fit$cptable[which.min(fit$cptable[, "xerror"]), "CP"]
fit_pruned <- prune(fit, cp = cp_star)

printcp() and plotcp() are practical safeguards: they help you avoid retaining tiny splits that do not survive on new data.

Tip

Trees are readable because they are built from simple rules, but readability does not guarantee reliability. Treat tree diagrams as model summaries, then validate them with proper resampling before drawing strong conclusions.

Lecture: Day 2 (Tuesday) slides
Notebook: penguins-classification.Rmd
Next: Chapter 4 — A reproducible modeling workflow