Chapter 6: Scores that match the question

Accuracy is easy to explain, so it often becomes the default score. The problem is that defaults can hide bad behavior, especially when classes are imbalanced or the decision costs are asymmetric.

This chapter reframes evaluation around the question you are answering: who is harmed by false negatives, who is harmed by false positives, and what tradeoff is acceptable in context.

← Reading companion · Back: Chapter 5 · Next: Chapter 7


Metrics are decision tools

Use metrics as answers to concrete questions:

  • Accuracy: “How often am I right overall?”
  • Sensitivity/recall: “How often do I catch true positives?”
  • Specificity: “How often do I correctly dismiss negatives?”
  • Precision: “When I predict positive, how often am I right?”
  • ROC-AUC: “How well do I rank positives above negatives?”
  • PR-AUC: “How well do I balance precision and recall under rarity?”

No metric is “best” by itself. The best metric is the one aligned with the decision cost.


Why imbalance changes the story

On imbalanced data, a majority-class classifier can look impressive under accuracy and still fail the task. That is why Day 4 compares majority baselines, per-class recall, and macro-style summaries.

When you use upsampling, do it inside training folds only. Assessment folds stay untouched so reported performance is still honest.

library(yardstick)
# After augment(fit, test_data):
# conf_mat(truth = y, estimate = .pred_class)

VIP and interpretation discipline

Variable importance (VIP) helps summarize what a fitted model used most often. It does not establish causal effects. Correlated features can share or swap importance, and importance can shift across model fits.

Treat importance plots as model behavior summaries, not intervention policies.


SHAP: Shapley values on penguins (sex)

SHAP gives local explanations: for one row, it decomposes the model output into feature contributions relative to a baseline. In this course, we apply it to penguin sex classification while omitting species.

The two key statements students should retain:

  1. SHAP explains this fitted model, not biological causality.
  2. SHAP values are useful communication aids, not proof that changing a feature would change the outcome.
pg_sex <- prep_penguins_sex()
rec_sex <- recipe(
  sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + island + year,
  data = pg_sex
) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

rf_sex_spec <- rand_forest(trees = 300, mtry = 3, min_n = 2) |>
  set_engine("ranger", probability = TRUE) |>
  set_mode("classification")

When presenting SHAP in reports, explicitly include one “do not claim” sentence. Example: “This plot does not show that bill depth causes sex; it shows how this trained model used bill depth for prediction.”

Tip

Strong evaluation language is as important as strong metrics. Say what the model does, how you measured it, and what you cannot infer from those numbers or explanations.


Lecture: Day 4 Part E/F
Notebook: penguins-sex-classification.Rmd
Next: Chapter 7 — Choosing what to optimize