# A tibble: 3 × 2
y3 n
<fct> <int>
1 Adelie 146
2 Gentoo 119
3 Chinstrap 15
[1] 22
[1] 8
parsnip specs (bagging, boosting, neural nets) on the same recipe + workflow from Tuesdayfit_resamples() using accuracy until imbalance| Term | Meaning |
|---|---|
| CV | Cross-validation - resample data into folds; train on some rows, evaluate on held-out rows |
| PCA | Principal component analysis - linear combinations of correlated predictors |
| NA | Missing value (not recorded) |
| ROC | Receiver operating characteristic curve - tradeoff of true vs false positives across thresholds |
| PR | Precision-recall curve - precision vs recall across thresholds |
| AUC | Area under the curve - one-number summary of a threshold curve (e.g. ROC-AUC, PR-AUC) |
| VIP | Variable importance - which predictors the model used most in splits |
| MLP | Multilayer perceptron - small feed-forward neural network |
| SHAP | SHapley additive explanations - local “why this prediction?” scores |
| RF | Random forest - bagged ensemble of decision trees |
| GBM | Gradient boosting machine - sequential trees that correct prior errors |
| XGBoost | Extreme gradient boosting - popular GBM engine in parsnip |
| SVM | Support vector machine - margin-based classifier (svm_rbf() uses a radial kernel) |
| TP / FP / FN / TN | True/false positive/negative - counts from a confusion matrix (Part E) |
Same peng_bal table and rec_base recipe as Tuesday - today we swap models and add recipe steps.
rec_base - step_zv (drop zero-variance columns) → step_dummy (one-hot categoricals) → step_normalize (scale numerics) on peng_balworkflow() + decision_tree / rpart tuned on Tuesdayvfold_cv(..., strata = y) - 5-fold cross-validation (CV) with both species in each foldmetric_set(accuracy) for Parts B-D today (more metrics in Part E)We deliberately stress the penguins in three ways:
# A tibble: 3 × 2
y3 n
<fct> <int>
1 Adelie 146
2 Gentoo 119
3 Chinstrap 15
[1] 22
[1] 8
| Object | Stress | Recipe extension (Part C) |
|---|---|---|
peng_imb3 |
Three species; downsampled Chinstrap (~15 Chinstrap vs abundant Adelie & Gentoo) | step_upsample(y3) - Part E metrics |
peng_na |
Missing bill length and some sex |
step_impute_*() |
peng_pca |
Extra size measures (correlated) | step_pca() |
Rule: keep add_recipe(rec_base) and folds - change add_model(spec) only.
Same rec_base from Tuesday - change spec (and set_engine()); keep workflow() + folds.
| Model spec | Typical tuned knobs | Common engines |
|---|---|---|
decision_tree() |
tree_depth, min_n, cost_complexity |
rpart, C5.0 |
rand_forest() (RF) |
mtry, min_n, trees |
ranger, randomForest |
boost_tree() (GBM) |
trees, tree_depth, learn_rate |
xgboost, lightgbm, C5.0 |
logistic_reg() |
penalty |
glm, glmnet |
linear_reg() |
penalty |
lm, glmnet |
mlp() (MLP) |
hidden_units, penalty, epochs |
nnet, keras |
svm_rbf() (SVM) |
cost, rbf_sigma |
kernlab |
nearest_neighbor() |
neighbors |
kknn |
mtry: number of predictors considered at each split (RF only).tune() marks hyperparameters for tune_grid().Every spec needs: model type → set_engine() → set_mode("classification") (or regression).
specPart A gave wf_tree (rec_base + tree_spec). Below: random forest (bagging) and XGBoost (boosting) on the same folds.
Tip
Shared recipe — Day 2-style recipe on microbiome; metadata in id role; ready to swap the spec.
rand_forest(mtry = ..., trees = ..., min_n = ...) |> set_engine("ranger")boost_tree(trees = ..., tree_depth = ..., learn_rate = ...) |> set_engine("xgboost")learn_rate: shrink each tree’s contribution (smaller → more trees needed, often smoother).rf_spec <- rand_forest(mtry = 3, trees = 400, min_n = 2) |>
set_engine("ranger", importance = "impurity") |>
set_mode("classification")
xgb_spec <- boost_tree(trees = 150, tree_depth = 3, learn_rate = 0.05) |>
set_engine("xgboost") |>
set_mode("classification")
list(tree = tree_spec, rf = rf_spec, xgb = xgb_spec)$tree
Decision Tree Model Specification (classification)
Main Arguments:
tree_depth = 4
min_n = 10
Computational engine: rpart
$rf
Random Forest Model Specification (classification)
Main Arguments:
mtry = 3
trees = 400
min_n = 2
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
$xgb
Boosted Tree Model Specification (classification)
Main Arguments:
trees = 150
tree_depth = 3
learn_rate = 0.05
Computational engine: xgboost
You split your data into training and test sets, then use cross-validation on the training set to tune hyperparameters for each model and compute their average performance. For each model type, you select the best-performing hyperparameter combination based on CV results, and then compare these tuned models to choose the best overall model. Finally, you refit that selected model on the full training data and evaluate it once on the untouched test set to get an unbiased estimate of real-world performance.
| Stage | What happens | tidymodels tools (examples) |
|---|---|---|
| Hold out test data | Test rows stay unused until the very end | initial_split(), training(), testing() |
| Tune on training only | CV estimates average performance while searching hyperparameters | vfold_cv(), tune_grid(), select_best() |
| Compare learners | Same folds + recipe; pick the best model type after tuning each | fit_resamples(), collect_metrics() |
| Final report | Refit on all training rows; score once on test | last_fit() |
fit_resamples() (next) averages CV metrics for a workflow - quick comparisons when specs are fixed or already tuned.last_fit() with a real train/test split.set.seed(7)
rs_rf <- fit_resamples(wf_rf, folds, metrics = metrics_acc)
rs_xgb <- fit_resamples(wf_xgb, folds, metrics = metrics_acc)
cmp_ens <- bind_rows(
collect_metrics(rs_tree) |> mutate(model = "Single tree"),
collect_metrics(rs_rf) |> mutate(model = "Random forest"),
collect_metrics(rs_xgb) |> mutate(model = "XGBoost")
) |>
filter(.metric == "accuracy")# A tibble: 3 × 3
model mean std_err
<chr> <dbl> <dbl>
1 Random forest 0.996 0.00377
2 XGBoost 0.993 0.00458
3 Single tree 0.992 0.00462
Tip
Model shoot-out — fit RF, XGBoost, MLP with group_vfold_cv(Individual); compare ROC AUC (or accuracy).
Lab exercises · Solution · Download Rmd
Fixed settings here for speed - in projects, wrap each spec in tune_grid() like Day 2 (Tuesday) pipeline.
We fix tree_depth = 2 and learn_rate = 0.3, then increase trees. Each round adds small trees that correct leftover mistakes - the Gentoo probability surface becomes more nonlinear. We show rounds 1, 3, and 6 (the pattern is similar in between).
One shallow tree: rough split - mostly underfits the Gentoo cloud.
hidden_units sets its width.epochs: how many full passes through the training data.mlp(hidden_units = ..., penalty = ..., epochs = ...) |> set_engine("nnet") - weight decay via penalty.rec_base via step_normalize).# A tibble: 3 × 4
model .metric mean std_err
<chr> <chr> <dbl> <dbl>
1 Random forest accuracy 0.996 0.00377
2 XGBoost accuracy 0.993 0.00458
3 MLP (nnet) accuracy 1 0
Three-way comparison: see the cross-validated accuracy bar chart above.
Same workflow, new questions:
PCA combines correlated numeric predictors into orthogonal principal components - fit step_pca() on training folds only, same as step_normalize().
For PCA we add back flipper length and body mass - bills and size track together on penguins.
step_pca() (code)Steps: zero-variance removal → dummies → scale numerics → retain 2 principal components.
After step_normalize(), each component is a weighted sum of the scaled columns:
\text{PC1} = w_{11}\,\text{bill\_length} + w_{12}\,\text{bill\_depth} + \cdots
rotation matrix from prcomp() / step_pca()).Scores tell you where a penguin is; loadings tell you which measurements built that axis.
Each bar = one original column’s weight on that component (predictors were scaled before PCA).
| measurement | PC1 | PC2 | |
|---|---|---|---|
| bill_length_mm | bill_length_mm | 0.52 | -0.03 |
| bill_depth_mm | bill_depth_mm | -0.40 | 0.03 |
| flipper_length_mm | flipper_length_mm | 0.54 | 0.04 |
| body_mass_g | body_mass_g | 0.52 | -0.09 |
| year | year | 0.05 | 0.99 |
# A tibble: 2 × 3
model mean std_err
<chr> <dbl> <dbl>
1 XGBoost + PCA (2 comp.) 0.996 0.00377
2 XGBoost, no PCA 0.989 0.00459
workflow() + resampling, never on the full table before splitting.Same workflow, new recipe stress: we now handle missing values with imputation steps inside cross-validation.
peng_na)Add step_impute_*() before step_zv() / dummies / normalize - never drop_na() on the full table before splitting.
We set ~15% of bill_length_mm (and a few sex) to NA (missing values - “not available”) to practice imputation - not a real sensor failure pattern.
drop_na()# A tibble: 1 × 2
rows_before rows_after_drop_na
<int> <int>
1 265 238
Dropping loses penguins and can bias who remains - never drop on the full table before splitting.
step_impute_median() for numeric bill_length_mm; step_impute_mode() for categorical sex.step_zv → step_dummy → step_normalize chain from Part A.Rows with missing bill length sit in the left margin (pink band) - we do not know their x-position yet.
Open circles = true bill length (simulation only). Gold = median imputed value for the same penguin. Orange arrows connect true → imputed. Gray points are other birds (observed bill length).
xgb_na_spec <- boost_tree(trees = 80, tree_depth = 3, learn_rate = 0.08) |>
set_engine("xgboost") |>
set_mode("classification")
wf_na <- workflow() |>
add_recipe(rec_impute) |>
add_model(xgb_na_spec)
folds_na <- vfold_cv(peng_na, v = 5, strata = y)
set.seed(14)
rs_na <- fit_resamples(wf_na, folds_na, metrics = metrics_acc)
collect_metrics(rs_na) |>
filter(.metric == "accuracy") |>
select(mean, std_err)# A tibble: 1 × 2
mean std_err
<dbl> <dbl>
1 0.996 0.00370
The workflow accepts NA in raw rows; each fold learns imputation on analysis data only.
Continue: Day 2 (Tuesday) pipeline.
We move from Adelie vs Gentoo to three species, with a deliberately rare Chinstrap class. Accuracy alone is no longer enough — we add macro recall/F1 and per-class recall.
peng_imb3)We now use three-species classification (Adelie, Gentoo, Chinstrap). For a clean imbalance lesson, we keep only 15 Chinstrap rows (deterministic selection), while Adelie and Gentoo stay abundant.
# A tibble: 1 × 2
strategy accuracy_on_peng_imb3
<chr> <dbl>
1 Always predict Adelie 0.521
peng_imb3)mc_tree_spec <- decision_tree(tree_depth = 4, min_n = 20) |>
set_engine("rpart") |>
set_mode("classification")
rec_imb3 <- recipe(y3 ~ ., data = peng_imb3) |>
step_zv(all_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_normalize(all_numeric_predictors())
rec_imb3_up <- recipe(y3 ~ ., data = peng_imb3) |>
step_upsample(y3) |>
step_zv(all_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_normalize(all_numeric_predictors())
wf_imb3 <- workflow() |> add_recipe(rec_imb3) |> add_model(mc_tree_spec)
wf_imb3_up <- workflow() |> add_recipe(rec_imb3_up) |> add_model(mc_tree_spec)| .metric | No upsample | Upsample |
|---|---|---|
| accuracy | 0.968 (0.028) | 0.971 (0.012) |
| f1_macro | 0.945 (0.052) | 0.940 (0.018) |
| recall_macro | 0.945 (0.052) | 0.970 (0.019) |
Truth
Prediction Adelie Gentoo Chinstrap
Adelie 30 0 3
Gentoo 0 23 0
Chinstrap 0 1 0
After choosing metrics for imbalance, we ask a different question: which predictors drove the fitted model? (VIP and SHAP in the next section.)
ranger).Tip
Variable importance (VIP) — VIP on a fitted tree-based model (e.g. random forest); top OTUs, not causal claims.
\phi_0 + \sum_j \phi_{i,j} \approx \hat{p}_i(\text{male})
speciessex (female / male). species is omitted on purpose so the task is not trivial (see sex notebook).recipe + random forest (ranger) pattern as Tuesday; we fit on all complete rows here so SHAP curves are stable in class (in production, fit SHAP on training data only).shapviz for plots; kernelshap to compute values on the fitted ranger forest.bake)pg_sex <- prep_penguins_sex()
rec_sex <- recipe(sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm +
body_mass_g + island + year, data = pg_sex) |>
step_zv(all_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_normalize(all_numeric_predictors())
rf_sex_spec <- rand_forest(trees = 300, mtry = 3, min_n = 2) |>
set_engine("ranger", probability = TRUE) |>
set_mode("classification")
wf_sex <- workflow() |>
add_recipe(rec_sex) |>
add_model(rf_sex_spec)
fit_sex <- fit(wf_sex, pg_sex)bake() — why not raw penguins?The forest was not trained on the original table. Inside fit(), the recipe ran first: drop empty columns, dummy-code island, z-score bill measurements, and so on. SHAP must perturb those columns — the ones ranger actually split on.
bake() — three steps (words)extract_recipe(fit_sex, estimated = TRUE) pulls out the recipe with all learned numbers (means, SDs, dummy levels) stored at fit time.prep() — finalises the preprocessing plan on pg_sex; does not refit the forest (same idea as Tuesday).bake() — applies that plan: raw rows in → X_model out (numeric predictors only, same encoding as training).bake() — codebake() — why SHAP careskernelshap() calls the model many times with partial rows of numbers. If you pass raw bill_length_mm while the forest expects scaled bills plus island_Torgersen, predictions — and SHAP — are wrong. X_model is the spreadsheet the trees were trained on.
SHAP needs: “given these feature values, what is P(male)?” The forest sits inside a workflow; we extract the bare ranger object and define prediction ourselves.
rf_engine <- extract_fit_parsnip(fit_sex)$fit
pred_fun <- function(object, X_new) {
as.numeric(predict(object, data = X_new)$predictions[, "male"])
}
set.seed(11)
X_shap <- dplyr::slice_sample(X_model, n = 80)
bg_X <- dplyr::slice_sample(X_model, n = 35)
ks <- kernelshap(rf_engine, X = X_shap, pred_fun = pred_fun, bg_X = bg_X)
shp_sex <- shapviz(ks, X_pred = X_shap)rf_engine and pred_funrf_engine — unwrap the ranger fit from tidymodels; SHAP talks to this object.pred_fun — bridge from baked matrix → P(male); SHAP does not know about recipes or factor levels.X_shap and bg_XX_shap — ~80 random rows of X_model to explain (SHAP is expensive).bg_X — ~35 background rows: when a feature is left out of a coalition, values are drawn from here (“typical” birds, not zeros).kernelshap() and shapviz()kernelshap(...) — approximate Shapley values for each row in X_shap, vs baseline bg_X.shapviz(...) — format for beeswarm / waterfall plots; X_pred supplies feature values for colour.Memory hook: bake() → spreadsheet the forest saw; X_shap → birds we explain; bg_X → reference flock; pred_fun → score that spreadsheet.
Tip
SHAP values — kernel SHAP on a top-10 VIP forest only; beeswarm for a dozen microbiome samples.
Male example (one bird labeled male in the SHAP sample):
Female example (contrasting profile):
Reasonable claims
Avoid
Humility prompt (assessment): What would you not claim from this SHAP beeswarm?
Full write-up: SHAP notebook
Tip
Wrap up honestly — metrics comparison figure + one sentence you would not claim from VIP/SHAP on this dataset.
parsnip model types and a valid engine for eachstep_pca() and step_impute_*() live inside the recipe?step_upsample() not improve CV scores?