Day 4 (Thursday): Models, Preprocessing, Metrics, and Importance

Learning objectives

  • Swap parsnip specs (bagging, boosting, neural nets) on the same recipe + workflow from Tuesday
  • Compare many models fairly with fit_resamples() using accuracy until imbalance
  • Add dimension reduction, imputation, and resampling as recipe steps inside cross-validation (CV)
  • Choose metrics beyond accuracy (sensitivity, precision-recall, confusion matrices) for imbalanced data
  • Read variable importance without causal over-interpretation

Terms used today (abbreviations)

Term Meaning
CV Cross-validation - resample data into folds; train on some rows, evaluate on held-out rows
PCA Principal component analysis - linear combinations of correlated predictors
NA Missing value (not recorded)
ROC Receiver operating characteristic curve - tradeoff of true vs false positives across thresholds
PR Precision-recall curve - precision vs recall across thresholds
AUC Area under the curve - one-number summary of a threshold curve (e.g. ROC-AUC, PR-AUC)
VIP Variable importance - which predictors the model used most in splits
MLP Multilayer perceptron - small feed-forward neural network
SHAP SHapley additive explanations - local “why this prediction?” scores
RF Random forest - bagged ensemble of decision trees
GBM Gradient boosting machine - sequential trees that correct prior errors
XGBoost Extreme gradient boosting - popular GBM engine in parsnip
SVM Support vector machine - margin-based classifier (svm_rbf() uses a radial kernel)
TP / FP / FN / TN True/false positive/negative - counts from a confusion matrix (Part E)

Part A - Recap (Tuesday’s pipeline)

Palmer penguins - back again

Same peng_bal table and rec_base recipe as Tuesday - today we swap models and add recipe steps.

What you already have

  • rec_base - step_zv (drop zero-variance columns) → step_dummy (one-hot categoricals) → step_normalize (scale numerics) on peng_bal
  • workflow() + decision_tree / rpart tuned on Tuesday
  • vfold_cv(..., strata = y) - 5-fold cross-validation (CV) with both species in each fold
  • metric_set(accuracy) for Parts B-D today (more metrics in Part E)

Teaching scenarios (Part C extensions)

We deliberately stress the penguins in three ways:

# A tibble: 3 × 2
  y3            n
  <fct>     <int>
1 Adelie      146
2 Gentoo      119
3 Chinstrap    15
[1] 22
[1] 8
Object Stress Recipe extension (Part C)
peng_imb3 Three species; downsampled Chinstrap (~15 Chinstrap vs abundant Adelie & Gentoo) step_upsample(y3) - Part E metrics
peng_na Missing bill length and some sex step_impute_*()
peng_pca Extra size measures (correlated) step_pca()

Part B - Many models (accuracy only)

Rule: keep add_recipe(rec_base) and folds - change add_model(spec) only.

Model catalog and engines

Same rec_base from Tuesday - change spec (and set_engine()); keep workflow() + folds.

Model spec Typical tuned knobs Common engines
decision_tree() tree_depth, min_n, cost_complexity rpart, C5.0
rand_forest() (RF) mtry, min_n, trees ranger, randomForest
boost_tree() (GBM) trees, tree_depth, learn_rate xgboost, lightgbm, C5.0
logistic_reg() penalty glm, glmnet
linear_reg() penalty lm, glmnet
mlp() (MLP) hidden_units, penalty, epochs nnet, keras
svm_rbf() (SVM) cost, rbf_sigma kernlab
nearest_neighbor() neighbors kknn

Model catalog - tuning and references

Parsnip: type → engine → mode

Every spec needs: model typeset_engine()set_mode("classification") (or regression).

rf_tune_spec <- rand_forest(trees = tune(), mtry = tune()) |>
  set_engine("ranger") |>
  set_mode("classification")
extract_parameter_set_dials(rf_tune_spec)

Same recipe, swap the spec

Part A gave wf_tree (rec_base + tree_spec). Below: random forest (bagging) and XGBoost (boosting) on the same folds.

Afternoon lab — Task 4.1

Tip

Shared recipe — Day 2-style recipe on microbiome; metadata in id role; ready to swap the spec.

Lab exercises · Solution · Download Rmd

Bagging: random forests

  • Many trees on bootstrap samples of rows; each split sees only a random subset of predictors → lower variance than one deep tree.
  • rand_forest(mtry = ..., trees = ..., min_n = ...) |> set_engine("ranger")

Boosting: gradient boosted trees

  • Trees are added sequentially; each new tree targets errors left by the ensemble so far.
  • boost_tree(trees = ..., tree_depth = ..., learn_rate = ...) |> set_engine("xgboost")
  • learn_rate: shrink each tree’s contribution (smaller → more trees needed, often smoother).

Forest and boosting specs (code)

rf_spec <- rand_forest(mtry = 3, trees = 400, min_n = 2) |>
  set_engine("ranger", importance = "impurity") |>
  set_mode("classification")
xgb_spec <- boost_tree(trees = 150, tree_depth = 3, learn_rate = 0.05) |>
  set_engine("xgboost") |>
  set_mode("classification")

list(tree = tree_spec, rf = rf_spec, xgb = xgb_spec)
$tree
Decision Tree Model Specification (classification)

Main Arguments:
  tree_depth = 4
  min_n = 10

Computational engine: rpart 


$rf
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 3
  trees = 400
  min_n = 2

Engine-Specific Arguments:
  importance = impurity

Computational engine: ranger 


$xgb
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 150
  tree_depth = 3
  learn_rate = 0.05

Computational engine: xgboost 
wf_rf <- workflow() |> add_recipe(rec_base) |> add_model(rf_spec)
wf_xgb <- workflow() |> add_recipe(rec_base) |> add_model(xgb_spec)

The honest evaluation workflow

You split your data into training and test sets, then use cross-validation on the training set to tune hyperparameters for each model and compute their average performance. For each model type, you select the best-performing hyperparameter combination based on CV results, and then compare these tuned models to choose the best overall model. Finally, you refit that selected model on the full training data and evaluate it once on the untouched test set to get an unbiased estimate of real-world performance.

Stage What happens tidymodels tools (examples)
Hold out test data Test rows stay unused until the very end initial_split(), training(), testing()
Tune on training only CV estimates average performance while searching hyperparameters vfold_cv(), tune_grid(), select_best()
Compare learners Same folds + recipe; pick the best model type after tuning each fit_resamples(), collect_metrics()
Final report Refit on all training rows; score once on test last_fit()
  • Recipe steps (dummy, scale, impute, PCA, upsample, …) must be fit inside each CV fold on training rows only - otherwise information leaks from held-out data.
  • fit_resamples() (next) averages CV metrics for a workflow - quick comparisons when specs are fixed or already tuned.
  • Chapter 4 — train/test split walks through last_fit() with a real train/test split.

Cross-validated accuracy (fit three specs)

set.seed(7)
rs_rf <- fit_resamples(wf_rf, folds, metrics = metrics_acc)
rs_xgb <- fit_resamples(wf_xgb, folds, metrics = metrics_acc)

cmp_ens <- bind_rows(
  collect_metrics(rs_tree) |> mutate(model = "Single tree"),
  collect_metrics(rs_rf) |> mutate(model = "Random forest"),
  collect_metrics(rs_xgb) |> mutate(model = "XGBoost")
) |>
  filter(.metric == "accuracy")

Cross-validated accuracy (results table)

# A tibble: 3 × 3
  model          mean std_err
  <chr>         <dbl>   <dbl>
1 Random forest 0.996 0.00377
2 XGBoost       0.993 0.00458
3 Single tree   0.992 0.00462

Cross-validated accuracy (bar chart)

Afternoon lab — Task 4.2

Tip

Model shoot-out — fit RF, XGBoost, MLP with group_vfold_cv(Individual); compare ROC AUC (or accuracy).

Lab exercises · Solution · Download Rmd

Fixed settings here for speed - in projects, wrap each spec in tune_grid() like Day 2 (Tuesday) pipeline.

Boosting: watch the boundary sharpen

We fix tree_depth = 2 and learn_rate = 0.3, then increase trees. Each round adds small trees that correct leftover mistakes - the Gentoo probability surface becomes more nonlinear. We show rounds 1, 3, and 6 (the pattern is similar in between).

boost_tree(trees = 6, tree_depth = 2, learn_rate = 0.3) |>
  set_engine("xgboost") |>
  set_mode("classification")
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 6
  tree_depth = 2
  learn_rate = 0.3

Computational engine: xgboost 

Boosting round 1

One shallow tree: rough split - mostly underfits the Gentoo cloud.

Boosting round 3

Boosting round 6

Bagging vs boosting (same bill plane)

  • Bagging (RF): many trees on bootstrap samples; predictions average → often smoother boundaries.
  • Boosting (XGBoost / GBM): trees added sequentially to fix errors → can look sharper after several rounds (previous slides).

Random forest boundary (bagging)

XGBoost boundary after 6 trees (boosting)

Neural network (small MLP) - schematic

Neural network (small MLP)

  • MLP = multilayer perceptron: one hidden layer of neurons; hidden_units sets its width.
  • epochs: how many full passes through the training data.
  • mlp(hidden_units = ..., penalty = ..., epochs = ...) |> set_engine("nnet") - weight decay via penalty.
  • Requires scaling (already in rec_base via step_normalize).
  • On small tabular data, trees often match or beat MLPs - still worth knowing the same workflow interface.

MLP spec and workflow (code)

mlp_spec <- mlp(hidden_units = 8, penalty = 0.1, epochs = 150) |>
  set_engine("nnet", trace = FALSE) |>
  set_mode("classification")

wf_mlp <- workflow() |>
  add_recipe(rec_base) |>
  add_model(mlp_spec)

MLP vs forests (cross-validated accuracy)

set.seed(9)
rs_mlp <- fit_resamples(wf_mlp, folds, metrics = metrics_acc)
mlp_cmp <- bind_rows(
  collect_metrics(rs_rf) |> mutate(model = "Random forest"),
  collect_metrics(rs_xgb) |> mutate(model = "XGBoost"),
  collect_metrics(rs_mlp) |> mutate(model = "MLP (nnet)")
) |>
  filter(.metric == "accuracy")
# A tibble: 3 × 4
  model         .metric   mean std_err
  <chr>         <chr>    <dbl>   <dbl>
1 Random forest accuracy 0.996 0.00377
2 XGBoost       accuracy 0.993 0.00458
3 MLP (nnet)    accuracy 1     0      

MLP decision boundary (bill plane)

Logistic regression boundary (bill plane)

Three-way comparison: see the cross-validated accuracy bar chart above.

Bridge: from model choice to feature engineering

Same workflow, new questions:

  • Part B: which model family fits best?
  • Part C: can PCA inside the recipe help?
  • Part D: how do we handle missing values in the recipe?
  • Parts E–F: which metrics and interpretation tools matter when classes are imbalanced?

Part C - Dimension reduction (accuracy only)

Extension: principal component analysis (PCA) in the recipe

PCA combines correlated numeric predictors into orthogonal principal components - fit step_pca() on training folds only, same as step_normalize().

PCA motivation: correlated measurements

For PCA we add back flipper length and body mass - bills and size track together on penguins.

Recipe with step_pca() (code)

rec_pca <- recipe(y ~ ., data = peng_pca) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_pca(all_numeric_predictors(), num_comp = 2)

Steps: zero-variance removal → dummies → scale numerics → retain 2 principal components.

Variance explained by each component

PCA biplot - scores vs loadings

  • Points = scores: where each penguin sits in PC1-PC2 space (compressed view of all numerics).
  • Arrows = loadings (preview): which original measurements push in which direction - detail on the next slides.

What are PCA loadings?

After step_normalize(), each component is a weighted sum of the scaled columns:

\text{PC1} = w_{11}\,\text{bill\_length} + w_{12}\,\text{bill\_depth} + \cdots

  • Loadings = those weights w_{jk} (stored in the rotation matrix from prcomp() / step_pca()).
  • Sign: positive loading → birds above average on that measurement tend to have higher scores on that PC.
  • Magnitude: larger |\text{loading}| → that measurement matters more for that component (after scaling).

Scores tell you where a penguin is; loadings tell you which measurements built that axis.

Loadings on PC1 and PC2 (bar chart)

Each bar = one original column’s weight on that component (predictors were scaled before PCA).

How to read these loadings (penguins)

Loadings (rounded) - same numerics as the plot above
measurement PC1 PC2
bill_length_mm bill_length_mm 0.52 -0.03
bill_depth_mm bill_depth_mm -0.40 0.03
flipper_length_mm flipper_length_mm 0.54 0.04
body_mass_g body_mass_g 0.52 -0.09
year year 0.05 0.99
  • PC1 loads heavily on bill length, flipper length, body mass (and bill depth negatively) → a “overall size” axis: big birds vs small birds along PC1.
  • PC2 is dominated by year → mostly study year, not morphology (weak for separating species here).
  • Takeaway: PCA merged correlated size measures into PC1; the model later sees PC1/PC2, not “mm of bill” directly - harder to explain biologically than raw columns.

XGBoost with PCA vs without (same folds)

set.seed(8)
rs_pca <- fit_resamples(wf_pca, folds_pca, metrics = metrics_acc)
set.seed(8)
rs_no_pca <- fit_resamples(wf_no_pca, folds_pca, metrics = metrics_acc)

XGBoost with PCA vs without (accuracy table)

# A tibble: 2 × 3
  model                    mean std_err
  <chr>                   <dbl>   <dbl>
1 XGBoost + PCA (2 comp.) 0.996 0.00377
2 XGBoost, no PCA         0.989 0.00459

XGBoost with PCA vs without (bar chart)

PCA tradeoffs and leakage

  • Tradeoff: components are not “bill mm” anymore - VIP on PCA columns is harder to explain biologically.
  • Leakage check: always fit PCA inside workflow() + resampling, never on the full table before splitting.

Bridge: from PCA to missing data

Same workflow, new recipe stress: we now handle missing values with imputation steps inside cross-validation.

Part D - Missing data (accuracy only)

Extension: imputation inside the recipe (peng_na)

Add step_impute_*() before step_zv() / dummies / normalize - never drop_na() on the full table before splitting.

Artificial missing data: bill length

We set ~15% of bill_length_mm (and a few sex) to NA (missing values - “not available”) to practice imputation - not a real sensor failure pattern.

Wrong fix: drop rows with drop_na()

tibble(
  rows_before = nrow(peng_na),
  rows_after_drop_na = nrow(tidyr::drop_na(peng_na))
)
# A tibble: 1 × 2
  rows_before rows_after_drop_na
        <int>              <int>
1         265                238

Dropping loses penguins and can bias who remains - never drop on the full table before splitting.

Right fix: impute inside the recipe

  • step_impute_median() for numeric bill_length_mm; step_impute_mode() for categorical sex.
  • Then the usual step_zvstep_dummystep_normalize chain from Part A.
rec_impute <- recipe(y ~ ., data = peng_na) |>
  step_impute_median(bill_length_mm) |>
  step_impute_mode(sex) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

Before imputation

Rows with missing bill length sit in the left margin (pink band) - we do not know their x-position yet.

After imputation

Open circles = true bill length (simulation only). Gold = median imputed value for the same penguin. Orange arrows connect true → imputed. Gray points are other birds (observed bill length).

sum(is.na(bake(prep_impute, peng_na)$bill_length_mm))
[1] 0

Imputation inside cross-validation

xgb_na_spec <- boost_tree(trees = 80, tree_depth = 3, learn_rate = 0.08) |>
  set_engine("xgboost") |>
  set_mode("classification")
wf_na <- workflow() |>
  add_recipe(rec_impute) |>
  add_model(xgb_na_spec)
folds_na <- vfold_cv(peng_na, v = 5, strata = y)
set.seed(14)
rs_na <- fit_resamples(wf_na, folds_na, metrics = metrics_acc)
collect_metrics(rs_na) |>
  filter(.metric == "accuracy") |>
  select(mean, std_err)
# A tibble: 1 × 2
   mean std_err
  <dbl>   <dbl>
1 0.996 0.00370

The workflow accepts NA in raw rows; each fold learns imputation on analysis data only.

Metrics and deployment (reminders)

  • Pick metrics before peeking at holdout data; report subgroup tables when ethically appropriate.
  • Data card: document NA handling and class prevalence before modeling.
  • Model card: intended use, limits, monitoring - finish the draft you started Monday.

Continue: Day 2 (Tuesday) pipeline.

Part E - Imbalance, metrics, and confusion matrices

Bridge: from binary metrics to multiclass imbalance

We move from Adelie vs Gentoo to three species, with a deliberately rare Chinstrap class. Accuracy alone is no longer enough — we add macro recall/F1 and per-class recall.

Multiclass imbalance (peng_imb3)

We now use three-species classification (Adelie, Gentoo, Chinstrap). For a clean imbalance lesson, we keep only 15 Chinstrap rows (deterministic selection), while Adelie and Gentoo stay abundant.

Why accuracy alone can still mislead

majority_label <- peng_imb3 |>
  dplyr::count(y3, sort = TRUE) |>
  dplyr::slice(1) |>
  dplyr::pull(y3) |>
  as.character()

tibble(
  strategy = paste("Always predict", majority_label),
  accuracy_on_peng_imb3 = mean(as.character(peng_imb3$y3) == majority_label)
)
# A tibble: 1 × 2
  strategy              accuracy_on_peng_imb3
  <chr>                                 <dbl>
1 Always predict Adelie                 0.521
  • Overall accuracy can look fine while one class (here Chinstrap) is mostly missed.
  • So we track macro recall, macro F1, and per-class recall.

Recipes and model (peng_imb3)

mc_tree_spec <- decision_tree(tree_depth = 4, min_n = 20) |>
  set_engine("rpart") |>
  set_mode("classification")

rec_imb3 <- recipe(y3 ~ ., data = peng_imb3) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

rec_imb3_up <- recipe(y3 ~ ., data = peng_imb3) |>
  step_upsample(y3) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

wf_imb3 <- workflow() |> add_recipe(rec_imb3) |> add_model(mc_tree_spec)
wf_imb3_up <- workflow() |> add_recipe(rec_imb3_up) |> add_model(mc_tree_spec)

Cross-validated metrics (multiclass)

set.seed(11)
rs_imb3 <- fit_resamples(wf_imb3, folds_imb3, metrics = metrics_cls_imb3)
set.seed(12)
rs_imb3_up <- fit_resamples(wf_imb3_up, folds_imb3, metrics = metrics_cls_imb3)

compare_metrics_tbl_multiclass(
  "No upsample" = rs_imb3,
  "Upsample" = rs_imb3_up
)
Multiclass CV metrics: mean (std err)
.metric No upsample Upsample
accuracy 0.968 (0.028) 0.971 (0.012)
f1_macro 0.945 (0.052) 0.940 (0.018)
recall_macro 0.945 (0.052) 0.970 (0.019)

Holdout confusion matrix: no upsample

set.seed(13)
split_imb3 <- initial_split(peng_imb3, prop = 0.8, strata = y3)
fit_imb3 <- fit(wf_imb3, training(split_imb3))
pred_imb3 <- augment(fit_imb3, testing(split_imb3))

pred_imb3 |>
  conf_mat(truth = y3, estimate = .pred_class)
           Truth
Prediction  Adelie Gentoo Chinstrap
  Adelie        30      0         3
  Gentoo         0     23         0
  Chinstrap      0      1         0

Holdout confusion matrix: with upsample

fit_imb3_up <- fit(wf_imb3_up, training(split_imb3))
pred_imb3_up <- augment(fit_imb3_up, testing(split_imb3))

pred_imb3_up |>
  conf_mat(truth = y3, estimate = .pred_class)
           Truth
Prediction  Adelie Gentoo Chinstrap
  Adelie        30      0         2
  Gentoo         0     23         0
  Chinstrap      0      1         1

Per-class recall (holdout)

  • In the confusion matrix: rows = true species, columns = predicted species.
  • Upsampling is not a free lunch, but it often improves minority-class recall on this engineered imbalance.
  • Use macro/per-class metrics when one class is underrepresented.

Part F - Variable importance

Bridge: from metrics to interpretation

After choosing metrics for imbalance, we ask a different question: which predictors drove the fitted model? (VIP and SHAP in the next section.)

Data the forest sees (bill plane)

Variable importance (VIP)

  • VIP = variable importance plot - how often splits use each baked predictor (impurity in ranger).
  • Correlated bill measures can share importance - not proof that changing one bill mm causes species shift.

Variable importance plot

Afternoon lab — Task 4.3

Tip

Variable importance (VIP) — VIP on a fitted tree-based model (e.g. random forest); top OTUs, not causal claims.

Lab exercises · Solution · Download Rmd

Permutation importance (pointer)

  • Permutation importance: shuffle one column, watch metric drop - model-agnostic; see the VIP plot above and the SHAP notebook.

SHAP - what it means (game theory intuition)

  • One penguin, one prediction: the model outputs P(male) (or log-odds). SHAP asks: how much did each measurement move this bird away from a baseline?
  • Players and coalitions: treat each feature as a player. A coalition is any subset of features with the others averaged over the data. The model’s output on that coalition is the coalition’s payout.
  • Shapley value: for each feature, average its marginal contribution over all orders in which features can join the coalition. That fair share is the SHAP value (_j) for feature (j).
  • Additivity: baseline (_0) plus all feature contributions sums to the model output on the link scale (for this forest, P(male)):

\phi_0 + \sum_j \phi_{i,j} \approx \hat{p}_i(\text{male})

  • VIP vs SHAP: VIP ranks predictors globally (how often splits use them). SHAP explains one row (why this penguin scored more male or female).

SHAP task - sex without species

  • Outcome: sex (female / male). species is omitted on purpose so the task is not trivial (see sex notebook).
  • Model: same recipe + random forest (ranger) pattern as Tuesday; we fit on all complete rows here so SHAP curves are stable in class (in production, fit SHAP on training data only).
  • Packages: shapviz for plots; kernelshap to compute values on the fitted ranger forest.

Compute SHAP in R (workflow + bake)

pg_sex <- prep_penguins_sex()

rec_sex <- recipe(sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
                    body_mass_g + island + year, data = pg_sex) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

rf_sex_spec <- rand_forest(trees = 300, mtry = 3, min_n = 2) |>
  set_engine("ranger", probability = TRUE) |>
  set_mode("classification")

wf_sex <- workflow() |>
  add_recipe(rec_sex) |>
  add_model(rf_sex_spec)

fit_sex <- fit(wf_sex, pg_sex)

bake() — why not raw penguins?

The forest was not trained on the original table. Inside fit(), the recipe ran first: drop empty columns, dummy-code island, z-score bill measurements, and so on. SHAP must perturb those columns — the ones ranger actually split on.

bake() — three steps (words)

  • Extract recipeextract_recipe(fit_sex, estimated = TRUE) pulls out the recipe with all learned numbers (means, SDs, dummy levels) stored at fit time.
  • prep() — finalises the preprocessing plan on pg_sex; does not refit the forest (same idea as Tuesday).
  • bake() — applies that plan: raw rows in → X_model out (numeric predictors only, same encoding as training).

bake() — code

rec_prepped <- prep(extract_recipe(fit_sex, estimated = TRUE), training = pg_sex)
X_model <- bake(rec_prepped, new_data = pg_sex, all_predictors())

head(X_model, 3)

bake() — why SHAP cares

kernelshap() calls the model many times with partial rows of numbers. If you pass raw bill_length_mm while the forest expects scaled bills plus island_Torgersen, predictions — and SHAP — are wrong. X_model is the spreadsheet the trees were trained on.

KernelSHAP — the problem

SHAP needs: “given these feature values, what is P(male)?” The forest sits inside a workflow; we extract the bare ranger object and define prediction ourselves.

KernelSHAP — code

rf_engine <- extract_fit_parsnip(fit_sex)$fit

pred_fun <- function(object, X_new) {
  as.numeric(predict(object, data = X_new)$predictions[, "male"])
}

set.seed(11)
X_shap <- dplyr::slice_sample(X_model, n = 80)
bg_X <- dplyr::slice_sample(X_model, n = 35)

ks <- kernelshap(rf_engine, X = X_shap, pred_fun = pred_fun, bg_X = bg_X)
shp_sex <- shapviz(ks, X_pred = X_shap)

KernelSHAP — rf_engine and pred_fun

  • rf_engine — unwrap the ranger fit from tidymodels; SHAP talks to this object.
  • pred_fun — bridge from baked matrix → P(male); SHAP does not know about recipes or factor levels.

KernelSHAP — X_shap and bg_X

  • X_shap — ~80 random rows of X_model to explain (SHAP is expensive).
  • bg_X — ~35 background rows: when a feature is left out of a coalition, values are drawn from here (“typical” birds, not zeros).

KernelSHAP — kernelshap() and shapviz()

  • kernelshap(...) — approximate Shapley values for each row in X_shap, vs baseline bg_X.
  • shapviz(...) — format for beeswarm / waterfall plots; X_pred supplies feature values for colour.

Memory hook: bake() → spreadsheet the forest saw; X_shap → birds we explain; bg_X → reference flock; pred_fun → score that spreadsheet.

Afternoon lab — Task 4.4

Tip

SHAP values — kernel SHAP on a top-10 VIP forest only; beeswarm for a dozen microbiome samples.

Lab exercises · Solution · Download Rmd

SHAP beeswarm - global pattern

  • y-axis: features; x-axis: SHAP contribution toward P(male).
  • Color: feature value (high vs low). Wide spread = this predictor matters for many birds in this model.

SHAP waterfall - two local explanations

Male example (one bird labeled male in the SHAP sample):

Female example (contrasting profile):

Reading SHAP without over-claiming

Reasonable claims

  • For this penguin, higher body mass (in the model, holding the coalition story fixed) pushed P(male) up or down.
  • Globally, bill depth and mass show the widest SHAP spread in this fitted RF.
  • SHAP describes the trained model, not a randomized experiment in the wild.

Avoid

  • “Bill length causes males” - observational data; morphometrics are correlated (data card).
  • “We should intervene on flipper length” - policy from association.
  • “SHAP proves fairness by island” - needs explicit subgroup metrics, not one plot.

Humility prompt (assessment): What would you not claim from this SHAP beeswarm?

Full write-up: SHAP notebook

Part G - Wrap-up

Afternoon lab — Task 4.5

Tip

Wrap up honestly — metrics comparison figure + one sentence you would not claim from VIP/SHAP on this dataset.

Lab exercises · Solution · Download Rmd

Nested resampling (when tuning gets serious)

  • Nested CV: an outer cross-validation loop estimates performance; an inner loop tunes hyperparameters inside each outer training fold.
  • Avoid tuning and scoring on the same held-out rows (optimistic bias).

End-of-day checklist

  • Name three parsnip model types and a valid engine for each
  • Why must step_pca() and step_impute_*() live inside the recipe?
  • When might step_upsample() not improve CV scores?
  • Which metric would you report for a rare-event screening study?
  • One reason VIP on correlated bills is not a causal claim
  • One sentence you would not say about a SHAP plot on penguin sex