Day 4 (Thursday): Models, Preprocessing, Metrics, and Importance

Learning objectives

Swap parsnip specs (bagging, boosting, neural nets) on the same recipe + workflow from Tuesday
Compare many models fairly with fit_resamples() using accuracy until imbalance
Add dimension reduction, imputation, and resampling as recipe steps inside cross-validation (CV)
Choose metrics beyond accuracy (sensitivity, precision-recall, confusion matrices) for imbalanced data
Read variable importance without causal over-interpretation

Terms used today (abbreviations)

Term	Meaning
CV	Cross-validation - resample data into folds; train on some rows, evaluate on held-out rows
PCA	Principal component analysis - linear combinations of correlated predictors
NA	Missing value (not recorded)
ROC	Receiver operating characteristic curve - tradeoff of true vs false positives across thresholds
PR	Precision-recall curve - precision vs recall across thresholds
AUC	Area under the curve - one-number summary of a threshold curve (e.g. ROC-AUC, PR-AUC)
VIP	Variable importance - which predictors the model used most in splits
MLP	Multilayer perceptron - small feed-forward neural network
SHAP	SHapley additive explanations - local “why this prediction?” scores
RF	Random forest - bagged ensemble of decision trees
GBM	Gradient boosting machine - sequential trees that correct prior errors
XGBoost	Extreme gradient boosting - popular GBM engine in `parsnip`
SVM	Support vector machine - margin-based classifier (`svm_rbf()` uses a radial kernel)
TP / FP / FN / TN	True/false positive/negative - counts from a confusion matrix (Part E)

Part A - Recap (Tuesday’s pipeline)

Palmer penguins - back again

Same peng_bal table and rec_base recipe as Tuesday - today we swap models and add recipe steps.

What you already have

rec_base - step_zv (drop zero-variance columns) → step_dummy (one-hot categoricals) → step_normalize (scale numerics) on peng_bal
workflow() + decision_tree / rpart tuned on Tuesday
vfold_cv(..., strata = y) - 5-fold cross-validation (CV) with both species in each fold
metric_set(accuracy) for Parts B-D today (more metrics in Part E)

Teaching scenarios (Part C extensions)

We deliberately stress the penguins in three ways:

# A tibble: 3 × 2
  y3            n
  <fct>     <int>
1 Adelie      146
2 Gentoo      119
3 Chinstrap    15

[1] 22

[1] 8

Object	Stress	Recipe extension (Part C)
`peng_imb3`	Three species; downsampled Chinstrap (~15 Chinstrap vs abundant Adelie & Gentoo)	`step_upsample(y3)` - Part E metrics
`peng_na`	Missing bill length and some `sex`	`step_impute_*()`
`peng_pca`	Extra size measures (correlated)	`step_pca()`

Part B - Many models (accuracy only)

Rule: keep add_recipe(rec_base) and folds - change add_model(spec) only.

Model catalog and engines

Same rec_base from Tuesday - change spec (and set_engine()); keep workflow() + folds.

Model spec	Typical tuned knobs	Common engines
`decision_tree()`	`tree_depth`, `min_n`, `cost_complexity`	`rpart`, `C5.0`
`rand_forest()` (RF)	`mtry`, `min_n`, `trees`	`ranger`, `randomForest`
`boost_tree()` (GBM)	`trees`, `tree_depth`, `learn_rate`	`xgboost`, `lightgbm`, `C5.0`
`logistic_reg()`	`penalty`	`glm`, `glmnet`
`linear_reg()`	`penalty`	`lm`, `glmnet`
`mlp()` (MLP)	`hidden_units`, `penalty`, `epochs`	`nnet`, `keras`
`svm_rbf()` (SVM)	`cost`, `rbf_sigma`	`kernlab`
`nearest_neighbor()`	`neighbors`	`kknn`

Model catalog - tuning and references

mtry: number of predictors considered at each split (RF only).
tune() marks hyperparameters for tune_grid().
Lookup: tidymodels - find models · Engines at a glance.
More depth: parsnip model reference · compare learners in Part B on this deck.

Parsnip: type → engine → mode

Every spec needs: model type → set_engine() → set_mode("classification") (or regression).

rf_tune_spec <- rand_forest(trees = tune(), mtry = tune()) |>
  set_engine("ranger") |>
  set_mode("classification")
extract_parameter_set_dials(rf_tune_spec)

Same recipe, swap the `spec`

Part A gave wf_tree (rec_base + tree_spec). Below: random forest (bagging) and XGBoost (boosting) on the same folds.

Afternoon lab — Task 4.1

Tip

Shared recipe — Day 2-style recipe on microbiome; metadata in id role; ready to swap the spec.

Lab exercises · Solution · Download Rmd

Bagging: random forests

Many trees on bootstrap samples of rows; each split sees only a random subset of predictors → lower variance than one deep tree.
rand_forest(mtry = ..., trees = ..., min_n = ...) |> set_engine("ranger")

Boosting: gradient boosted trees

Trees are added sequentially; each new tree targets errors left by the ensemble so far.
boost_tree(trees = ..., tree_depth = ..., learn_rate = ...) |> set_engine("xgboost")
learn_rate: shrink each tree’s contribution (smaller → more trees needed, often smoother).

Forest and boosting specs (code)

rf_spec <- rand_forest(mtry = 3, trees = 400, min_n = 2) |>
  set_engine("ranger", importance = "impurity") |>
  set_mode("classification")
xgb_spec <- boost_tree(trees = 150, tree_depth = 3, learn_rate = 0.05) |>
  set_engine("xgboost") |>
  set_mode("classification")

list(tree = tree_spec, rf = rf_spec, xgb = xgb_spec)

$tree
Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = 4
  min_n = 10

Computational engine: rpart 


$rf
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 3
  trees = 400
  min_n = 2

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 


$xgb
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 150
  tree_depth = 3
  learn_rate = 0.05

Computational engine: xgboost

wf_rf <- workflow() |> add_recipe(rec_base) |> add_model(rf_spec)
wf_xgb <- workflow() |> add_recipe(rec_base) |> add_model(xgb_spec)

The honest evaluation workflow

You split your data into training and test sets, then use cross-validation on the training set to tune hyperparameters for each model and compute their average performance. For each model type, you select the best-performing hyperparameter combination based on CV results, and then compare these tuned models to choose the best overall model. Finally, you refit that selected model on the full training data and evaluate it once on the untouched test set to get an unbiased estimate of real-world performance.

Stage	What happens	tidymodels tools (examples)
Hold out test data	Test rows stay unused until the very end	`initial_split()`, `training()`, `testing()`
Tune on training only	CV estimates average performance while searching hyperparameters	`vfold_cv()`, `tune_grid()`, `select_best()`
Compare learners	Same folds + recipe; pick the best model type after tuning each	`fit_resamples()`, `collect_metrics()`
Final report	Refit on all training rows; score once on test	`last_fit()`

Recipe steps (dummy, scale, impute, PCA, upsample, …) must be fit inside each CV fold on training rows only - otherwise information leaks from held-out data.
fit_resamples() (next) averages CV metrics for a workflow - quick comparisons when specs are fixed or already tuned.
Chapter 4 — train/test split walks through last_fit() with a real train/test split.

Cross-validated accuracy (fit three specs)

set.seed(7)
rs_rf <- fit_resamples(wf_rf, folds, metrics = metrics_acc)
rs_xgb <- fit_resamples(wf_xgb, folds, metrics = metrics_acc)

cmp_ens <- bind_rows(
  collect_metrics(rs_tree) |> mutate(model = "Single tree"),
  collect_metrics(rs_rf) |> mutate(model = "Random forest"),
  collect_metrics(rs_xgb) |> mutate(model = "XGBoost")
) |>
  filter(.metric == "accuracy")

Cross-validated accuracy (results table)

# A tibble: 3 × 3
  model          mean std_err
  <chr>         <dbl>   <dbl>
1 Random forest 0.996 0.00377
2 XGBoost       0.993 0.00458
3 Single tree   0.992 0.00462

Cross-validated accuracy (bar chart)

Afternoon lab — Task 4.2

Tip

Model shoot-out — fit RF, XGBoost, MLP with group_vfold_cv(Individual); compare ROC AUC (or accuracy).

Lab exercises · Solution · Download Rmd

Fixed settings here for speed - in projects, wrap each spec in tune_grid() like Day 2 (Tuesday) pipeline.

Boosting: watch the boundary sharpen

We fix tree_depth = 2 and learn_rate = 0.3, then increase trees. Each round adds small trees that correct leftover mistakes - the Gentoo probability surface becomes more nonlinear. We show rounds 1, 3, and 6 (the pattern is similar in between).

boost_tree(trees = 6, tree_depth = 2, learn_rate = 0.3) |>
  set_engine("xgboost") |>
  set_mode("classification")

Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 6
  tree_depth = 2
  learn_rate = 0.3

Computational engine: xgboost

Boosting round 1

One shallow tree: rough split - mostly underfits the Gentoo cloud.

Boosting round 3

Boosting round 6

Bagging vs boosting (same bill plane)

Bagging (RF): many trees on bootstrap samples; predictions average → often smoother boundaries.
Boosting (XGBoost / GBM): trees added sequentially to fix errors → can look sharper after several rounds (previous slides).

Random forest boundary (bagging)

XGBoost boundary after 6 trees (boosting)

Neural network (small MLP) - schematic

Neural network (small MLP)

MLP = multilayer perceptron: one hidden layer of neurons; hidden_units sets its width.
epochs: how many full passes through the training data.
mlp(hidden_units = ..., penalty = ..., epochs = ...) |> set_engine("nnet") - weight decay via penalty.
Requires scaling (already in rec_base via step_normalize).
On small tabular data, trees often match or beat MLPs - still worth knowing the same workflow interface.

MLP spec and workflow (code)

mlp_spec <- mlp(hidden_units = 8, penalty = 0.1, epochs = 150) |>
  set_engine("nnet", trace = FALSE) |>
  set_mode("classification")

wf_mlp <- workflow() |>
  add_recipe(rec_base) |>
  add_model(mlp_spec)

MLP vs forests (cross-validated accuracy)

set.seed(9)
rs_mlp <- fit_resamples(wf_mlp, folds, metrics = metrics_acc)
mlp_cmp <- bind_rows(
  collect_metrics(rs_rf) |> mutate(model = "Random forest"),
  collect_metrics(rs_xgb) |> mutate(model = "XGBoost"),
  collect_metrics(rs_mlp) |> mutate(model = "MLP (nnet)")
) |>
  filter(.metric == "accuracy")

# A tibble: 3 × 4
  model         .metric   mean std_err
  <chr>         <chr>    <dbl>   <dbl>
1 Random forest accuracy 0.996 0.00377
2 XGBoost       accuracy 0.993 0.00458
3 MLP (nnet)    accuracy 1     0

MLP decision boundary (bill plane)

Logistic regression boundary (bill plane)

Three-way comparison: see the cross-validated accuracy bar chart above.

Bridge: from model choice to feature engineering

Same workflow, new questions:

Part B: which model family fits best?
Part C: can PCA inside the recipe help?
Part D: how do we handle missing values in the recipe?
Parts E–F: which metrics and interpretation tools matter when classes are imbalanced?

Part C - Dimension reduction (accuracy only)

Extension: principal component analysis (PCA) in the recipe

PCA combines correlated numeric predictors into orthogonal principal components - fit step_pca() on training folds only, same as step_normalize().

PCA motivation: correlated measurements

For PCA we add back flipper length and body mass - bills and size track together on penguins.

Recipe with `step_pca()` (code)

rec_pca <- recipe(y ~ ., data = peng_pca) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_pca(all_numeric_predictors(), num_comp = 2)

Steps: zero-variance removal → dummies → scale numerics → retain 2 principal components.

Variance explained by each component

PCA biplot - scores vs loadings

Points = scores: where each penguin sits in PC1-PC2 space (compressed view of all numerics).
Arrows = loadings (preview): which original measurements push in which direction - detail on the next slides.

What are PCA loadings?

After step_normalize(), each component is a weighted sum of the scaled columns:

\text{PC1} = w_{11}\,\text{bill\_length} + w_{12}\,\text{bill\_depth} + \cdots

Loadings = those weights w_{jk} (stored in the rotation matrix from prcomp() / step_pca()).
Sign: positive loading → birds above average on that measurement tend to have higher scores on that PC.
Magnitude: larger |\text{loading}| → that measurement matters more for that component (after scaling).

Scores tell you where a penguin is; loadings tell you which measurements built that axis.

Loadings on PC1 and PC2 (bar chart)

Each bar = one original column’s weight on that component (predictors were scaled before PCA).

How to read these loadings (penguins)

Loadings (rounded) - same numerics as the plot above
	measurement	PC1	PC2
bill_length_mm	bill_length_mm	0.52	-0.03
bill_depth_mm	bill_depth_mm	-0.40	0.03
flipper_length_mm	flipper_length_mm	0.54	0.04
body_mass_g	body_mass_g	0.52	-0.09
year	year	0.05	0.99

PC1 loads heavily on bill length, flipper length, body mass (and bill depth negatively) → a “overall size” axis: big birds vs small birds along PC1.
PC2 is dominated by year → mostly study year, not morphology (weak for separating species here).
Takeaway: PCA merged correlated size measures into PC1; the model later sees PC1/PC2, not “mm of bill” directly - harder to explain biologically than raw columns.

XGBoost with PCA vs without (same folds)

set.seed(8)
rs_pca <- fit_resamples(wf_pca, folds_pca, metrics = metrics_acc)
set.seed(8)
rs_no_pca <- fit_resamples(wf_no_pca, folds_pca, metrics = metrics_acc)

XGBoost with PCA vs without (accuracy table)

# A tibble: 2 × 3
  model                    mean std_err
  <chr>                   <dbl>   <dbl>
1 XGBoost + PCA (2 comp.) 0.996 0.00377
2 XGBoost, no PCA         0.989 0.00459

XGBoost with PCA vs without (bar chart)

PCA tradeoffs and leakage

Tradeoff: components are not “bill mm” anymore - VIP on PCA columns is harder to explain biologically.
Leakage check: always fit PCA inside workflow() + resampling, never on the full table before splitting.

Bridge: from PCA to missing data

Same workflow, new recipe stress: we now handle missing values with imputation steps inside cross-validation.

Part D - Missing data (accuracy only)

Extension: imputation inside the recipe (`peng_na`)

Add step_impute_*() before step_zv() / dummies / normalize - never drop_na() on the full table before splitting.

Artificial missing data: bill length

We set ~15% of bill_length_mm (and a few sex) to NA (missing values - “not available”) to practice imputation - not a real sensor failure pattern.

Wrong fix: drop rows with `drop_na()`

tibble(
  rows_before = nrow(peng_na),
  rows_after_drop_na = nrow(tidyr::drop_na(peng_na))
)

# A tibble: 1 × 2
  rows_before rows_after_drop_na
        <int>              <int>
1         265                238

Dropping loses penguins and can bias who remains - never drop on the full table before splitting.

Right fix: impute inside the recipe

step_impute_median() for numeric bill_length_mm; step_impute_mode() for categorical sex.
Then the usual step_zv → step_dummy → step_normalize chain from Part A.

rec_impute <- recipe(y ~ ., data = peng_na) |>
  step_impute_median(bill_length_mm) |>
  step_impute_mode(sex) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

Before imputation

Rows with missing bill length sit in the left margin (pink band) - we do not know their x-position yet.

After imputation

Open circles = true bill length (simulation only). Gold = median imputed value for the same penguin. Orange arrows connect true → imputed. Gray points are other birds (observed bill length).

sum(is.na(bake(prep_impute, peng_na)$bill_length_mm))

[1] 0

Imputation inside cross-validation

xgb_na_spec <- boost_tree(trees = 80, tree_depth = 3, learn_rate = 0.08) |>
  set_engine("xgboost") |>
  set_mode("classification")
wf_na <- workflow() |>
  add_recipe(rec_impute) |>
  add_model(xgb_na_spec)
folds_na <- vfold_cv(peng_na, v = 5, strata = y)
set.seed(14)
rs_na <- fit_resamples(wf_na, folds_na, metrics = metrics_acc)
collect_metrics(rs_na) |>
  filter(.metric == "accuracy") |>
  select(mean, std_err)

# A tibble: 1 × 2
   mean std_err
  <dbl>   <dbl>
1 0.996 0.00370

The workflow accepts NA in raw rows; each fold learns imputation on analysis data only.

Metrics and deployment (reminders)

Pick metrics before peeking at holdout data; report subgroup tables when ethically appropriate.
Data card: document NA handling and class prevalence before modeling.
Model card: intended use, limits, monitoring - finish the draft you started Monday.

Continue: Day 2 (Tuesday) pipeline.

Part E - Imbalance, metrics, and confusion matrices

Bridge: from binary metrics to multiclass imbalance

We move from Adelie vs Gentoo to three species, with a deliberately rare Chinstrap class. Accuracy alone is no longer enough — we add macro recall/F1 and per-class recall.

Multiclass imbalance (`peng_imb3`)

We now use three-species classification (Adelie, Gentoo, Chinstrap). For a clean imbalance lesson, we keep only 15 Chinstrap rows (deterministic selection), while Adelie and Gentoo stay abundant.

Why accuracy alone can still mislead

majority_label <- peng_imb3 |>
  dplyr::count(y3, sort = TRUE) |>
  dplyr::slice(1) |>
  dplyr::pull(y3) |>
  as.character()

tibble(
  strategy = paste("Always predict", majority_label),
  accuracy_on_peng_imb3 = mean(as.character(peng_imb3$y3) == majority_label)
)

# A tibble: 1 × 2
  strategy              accuracy_on_peng_imb3
  <chr>                                 <dbl>
1 Always predict Adelie                 0.521

Overall accuracy can look fine while one class (here Chinstrap) is mostly missed.
So we track macro recall, macro F1, and per-class recall.

Recipes and model (`peng_imb3`)

mc_tree_spec <- decision_tree(tree_depth = 4, min_n = 20) |>
  set_engine("rpart") |>
  set_mode("classification")

rec_imb3 <- recipe(y3 ~ ., data = peng_imb3) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

rec_imb3_up <- recipe(y3 ~ ., data = peng_imb3) |>
  step_upsample(y3) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

wf_imb3 <- workflow() |> add_recipe(rec_imb3) |> add_model(mc_tree_spec)
wf_imb3_up <- workflow() |> add_recipe(rec_imb3_up) |> add_model(mc_tree_spec)

Cross-validated metrics (multiclass)

set.seed(11)
rs_imb3 <- fit_resamples(wf_imb3, folds_imb3, metrics = metrics_cls_imb3)
set.seed(12)
rs_imb3_up <- fit_resamples(wf_imb3_up, folds_imb3, metrics = metrics_cls_imb3)

compare_metrics_tbl_multiclass(
  "No upsample" = rs_imb3,
  "Upsample" = rs_imb3_up
)

Multiclass CV metrics: mean (std err)
.metric	No upsample	Upsample
accuracy	0.968 (0.028)	0.971 (0.012)
f1_macro	0.945 (0.052)	0.940 (0.018)
recall_macro	0.945 (0.052)	0.970 (0.019)

Holdout confusion matrix: no upsample

set.seed(13)
split_imb3 <- initial_split(peng_imb3, prop = 0.8, strata = y3)
fit_imb3 <- fit(wf_imb3, training(split_imb3))
pred_imb3 <- augment(fit_imb3, testing(split_imb3))

pred_imb3 |>
  conf_mat(truth = y3, estimate = .pred_class)

           Truth
Prediction  Adelie Gentoo Chinstrap
  Adelie        30      0         3
  Gentoo         0     23         0
  Chinstrap      0      1         0

Holdout confusion matrix: with upsample

fit_imb3_up <- fit(wf_imb3_up, training(split_imb3))
pred_imb3_up <- augment(fit_imb3_up, testing(split_imb3))

pred_imb3_up |>
  conf_mat(truth = y3, estimate = .pred_class)

           Truth
Prediction  Adelie Gentoo Chinstrap
  Adelie        30      0         2
  Gentoo         0     23         0
  Chinstrap      0      1         1

Per-class recall (holdout)

In the confusion matrix: rows = true species, columns = predicted species.
Upsampling is not a free lunch, but it often improves minority-class recall on this engineered imbalance.
Use macro/per-class metrics when one class is underrepresented.

Part F - Variable importance

Bridge: from metrics to interpretation

After choosing metrics for imbalance, we ask a different question: which predictors drove the fitted model? (VIP and SHAP in the next section.)

Data the forest sees (bill plane)

Variable importance (VIP)

VIP = variable importance plot - how often splits use each baked predictor (impurity in ranger).
Correlated bill measures can share importance - not proof that changing one bill mm causes species shift.

Variable importance plot

Afternoon lab — Task 4.3

Tip

Variable importance (VIP) — VIP on a fitted tree-based model (e.g. random forest); top OTUs, not causal claims.

Lab exercises · Solution · Download Rmd

Permutation importance (pointer)

Permutation importance: shuffle one column, watch metric drop - model-agnostic; see the VIP plot above and the SHAP notebook.

SHAP - what it means (game theory intuition)

One penguin, one prediction: the model outputs P(male) (or log-odds). SHAP asks: how much did each measurement move this bird away from a baseline?
Players and coalitions: treat each feature as a player. A coalition is any subset of features with the others averaged over the data. The model’s output on that coalition is the coalition’s payout.
Shapley value: for each feature, average its marginal contribution over all orders in which features can join the coalition. That fair share is the SHAP value (_j) for feature (j).
Additivity: baseline (_0) plus all feature contributions sums to the model output on the link scale (for this forest, P(male)):

\phi_0 + \sum_j \phi_{i,j} \approx \hat{p}_i(\text{male})

VIP vs SHAP: VIP ranks predictors globally (how often splits use them). SHAP explains one row (why this penguin scored more male or female).

SHAP task - sex without `species`

Outcome: sex (female / male). species is omitted on purpose so the task is not trivial (see sex notebook).
Model: same recipe + random forest (ranger) pattern as Tuesday; we fit on all complete rows here so SHAP curves are stable in class (in production, fit SHAP on training data only).
Packages: shapviz for plots; kernelshap to compute values on the fitted ranger forest.

Compute SHAP in R (workflow + `bake`)

pg_sex <- prep_penguins_sex()

rec_sex <- recipe(sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
                    body_mass_g + island + year, data = pg_sex) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

rf_sex_spec <- rand_forest(trees = 300, mtry = 3, min_n = 2) |>
  set_engine("ranger", probability = TRUE) |>
  set_mode("classification")

wf_sex <- workflow() |>
  add_recipe(rec_sex) |>
  add_model(rf_sex_spec)

fit_sex <- fit(wf_sex, pg_sex)

`bake()` — why not raw penguins?

The forest was not trained on the original table. Inside fit(), the recipe ran first: drop empty columns, dummy-code island, z-score bill measurements, and so on. SHAP must perturb those columns — the ones ranger actually split on.

`bake()` — three steps (words)

Extract recipe — extract_recipe(fit_sex, estimated = TRUE) pulls out the recipe with all learned numbers (means, SDs, dummy levels) stored at fit time.
prep() — finalises the preprocessing plan on pg_sex; does not refit the forest (same idea as Tuesday).
bake() — applies that plan: raw rows in → X_model out (numeric predictors only, same encoding as training).

`bake()` — code

rec_prepped <- prep(extract_recipe(fit_sex, estimated = TRUE), training = pg_sex)
X_model <- bake(rec_prepped, new_data = pg_sex, all_predictors())

head(X_model, 3)

`bake()` — why SHAP cares

kernelshap() calls the model many times with partial rows of numbers. If you pass raw bill_length_mm while the forest expects scaled bills plus island_Torgersen, predictions — and SHAP — are wrong. X_model is the spreadsheet the trees were trained on.

KernelSHAP — the problem

SHAP needs: “given these feature values, what is P(male)?” The forest sits inside a workflow; we extract the bare ranger object and define prediction ourselves.

KernelSHAP — code

rf_engine <- extract_fit_parsnip(fit_sex)$fit

pred_fun <- function(object, X_new) {
  as.numeric(predict(object, data = X_new)$predictions[, "male"])
}

set.seed(11)
X_shap <- dplyr::slice_sample(X_model, n = 80)
bg_X <- dplyr::slice_sample(X_model, n = 35)

ks <- kernelshap(rf_engine, X = X_shap, pred_fun = pred_fun, bg_X = bg_X)
shp_sex <- shapviz(ks, X_pred = X_shap)

KernelSHAP — `rf_engine` and `pred_fun`

rf_engine — unwrap the ranger fit from tidymodels; SHAP talks to this object.
pred_fun — bridge from baked matrix → P(male); SHAP does not know about recipes or factor levels.

KernelSHAP — `X_shap` and `bg_X`

X_shap — ~80 random rows of X_model to explain (SHAP is expensive).
bg_X — ~35 background rows: when a feature is left out of a coalition, values are drawn from here (“typical” birds, not zeros).

KernelSHAP — `kernelshap()` and `shapviz()`

kernelshap(...) — approximate Shapley values for each row in X_shap, vs baseline bg_X.
shapviz(...) — format for beeswarm / waterfall plots; X_pred supplies feature values for colour.

Memory hook: bake() → spreadsheet the forest saw; X_shap → birds we explain; bg_X → reference flock; pred_fun → score that spreadsheet.

Afternoon lab — Task 4.4

Tip

SHAP values — kernel SHAP on a top-10 VIP forest only; beeswarm for a dozen microbiome samples.

Lab exercises · Solution · Download Rmd

SHAP beeswarm - global pattern

y-axis: features; x-axis: SHAP contribution toward P(male).
Color: feature value (high vs low). Wide spread = this predictor matters for many birds in this model.

SHAP waterfall - two local explanations

Male example (one bird labeled male in the SHAP sample):

Female example (contrasting profile):

Reading SHAP without over-claiming

Reasonable claims

For this penguin, higher body mass (in the model, holding the coalition story fixed) pushed P(male) up or down.
Globally, bill depth and mass show the widest SHAP spread in this fitted RF.
SHAP describes the trained model, not a randomized experiment in the wild.

Avoid

“Bill length causes males” - observational data; morphometrics are correlated (data card).
“We should intervene on flipper length” - policy from association.
“SHAP proves fairness by island” - needs explicit subgroup metrics, not one plot.

Humility prompt (assessment): What would you not claim from this SHAP beeswarm?

Full write-up: SHAP notebook

Part G - Wrap-up

Afternoon lab — Task 4.5

Tip

Wrap up honestly — metrics comparison figure + one sentence you would not claim from VIP/SHAP on this dataset.

Lab exercises · Solution · Download Rmd

Nested resampling (when tuning gets serious)

Nested CV: an outer cross-validation loop estimates performance; an inner loop tunes hyperparameters inside each outer training fold.
Avoid tuning and scoring on the same held-out rows (optimistic bias).

End-of-day checklist

Name three parsnip model types and a valid engine for each
Why must step_pca() and step_impute_*() live inside the recipe?
When might step_upsample() not improve CV scores?
Which metric would you report for a rare-event screening study?
One reason VIP on correlated bills is not a causal claim
One sentence you would not say about a SHAP plot on penguin sex

Day 4 (Thursday): Models, Preprocessing, Metrics, and Importance

Learning objectives

Navigation

Terms used today (abbreviations)

Part A - Recap (Tuesday’s pipeline)

Palmer penguins - back again

What you already have

Teaching scenarios (Part C extensions)

Part B - Many models (accuracy only)

Model catalog and engines

Model catalog - tuning and references

Parsnip: type → engine → mode

Same recipe, swap the spec

Afternoon lab — Task 4.1

Bagging: random forests

Boosting: gradient boosted trees

Forest and boosting specs (code)

The honest evaluation workflow

Cross-validated accuracy (fit three specs)

Cross-validated accuracy (results table)

Cross-validated accuracy (bar chart)

Afternoon lab — Task 4.2

Boosting: watch the boundary sharpen

Boosting round 1

Boosting round 3

Boosting round 6

Bagging vs boosting (same bill plane)

Random forest boundary (bagging)

XGBoost boundary after 6 trees (boosting)

Neural network (small MLP) - schematic

Neural network (small MLP)

MLP spec and workflow (code)

MLP vs forests (cross-validated accuracy)

MLP decision boundary (bill plane)

Logistic regression boundary (bill plane)

Bridge: from model choice to feature engineering

Part C - Dimension reduction (accuracy only)

Extension: principal component analysis (PCA) in the recipe

PCA motivation: correlated measurements

Recipe with step_pca() (code)

Variance explained by each component

PCA biplot - scores vs loadings

What are PCA loadings?

Loadings on PC1 and PC2 (bar chart)

How to read these loadings (penguins)

XGBoost with PCA vs without (same folds)

XGBoost with PCA vs without (accuracy table)

XGBoost with PCA vs without (bar chart)

PCA tradeoffs and leakage

Bridge: from PCA to missing data

Part D - Missing data (accuracy only)

Extension: imputation inside the recipe (peng_na)

Artificial missing data: bill length

Wrong fix: drop rows with drop_na()

Right fix: impute inside the recipe

Before imputation

After imputation

Imputation inside cross-validation

Metrics and deployment (reminders)

Part E - Imbalance, metrics, and confusion matrices

Bridge: from binary metrics to multiclass imbalance

Multiclass imbalance (peng_imb3)

Why accuracy alone can still mislead

Recipes and model (peng_imb3)

Cross-validated metrics (multiclass)

Holdout confusion matrix: no upsample

Holdout confusion matrix: with upsample

Per-class recall (holdout)

Part F - Variable importance

Bridge: from metrics to interpretation

Data the forest sees (bill plane)

Variable importance (VIP)

Variable importance plot

Afternoon lab — Task 4.3

Permutation importance (pointer)

SHAP - what it means (game theory intuition)

SHAP task - sex without species

Compute SHAP in R (workflow + bake)

bake() — why not raw penguins?

bake() — three steps (words)

Same recipe, swap the `spec`

Recipe with `step_pca()` (code)

Extension: imputation inside the recipe (`peng_na`)

Wrong fix: drop rows with `drop_na()`

Multiclass imbalance (`peng_imb3`)

Recipes and model (`peng_imb3`)

SHAP task - sex without `species`

Compute SHAP in R (workflow + `bake`)

`bake()` — why not raw penguins?

`bake()` — three steps (words)

`bake()` — code

`bake()` — why SHAP cares

KernelSHAP — `rf_engine` and `pred_fun`

KernelSHAP — `X_shap` and `bg_X`

KernelSHAP — `kernelshap()` and `shapviz()`