Curriculum Outline

Course week (Days 1–4)

Teaching decks on this site: Day 1 (Monday), Day 2 (Tuesday), and Day 4 (Thursday). Day 3 (Wednesday) is a mini symposium and social event — no lecture deck here.

This outline links to afternoon lab exercises (numbered tasks and one solution notebook per teaching day) while weaving a few ideas through the week: two parallel examples (simulated genes where we know the truth + Palmer Penguins as one real table), what a wrong prediction costs (thresholds, prevalence), calibration, “correlation is not causation” when reading importance plots, short data and model summaries, reproducibility, fairness-style subgroup checks, and assignments that reward honest wording (especially around SHAP).

Integrated design principles (built into the week)

Two parallel data tracks — Synthetic genes in the original linear / logistic notebooks: we scripted the biology so we know which coefficients should matter. Palmer Penguins (data card, catalog, dedicated notebooks, Tuesday’s full modeling walk-through in R using tidymodels; reading companion for Chapters 4, 7, and 8): one small real table reused Mon→Thu for plots, regression, classification, and workflows.
Decision costs before a long list of metrics — introduce thresholds, prevalence, and misclassification costs early; return to them on Thursday so ROC / PR / F1 choices feel motivated, not arbitrary.
Association ≠ intervention — embed a short “causal guardrails” beat when introducing importance and SHAP so plots are not read as policy levers by default.
Data cards + model cards — start a one-page data card Monday (where the numbers came from, leakage risks, subgroup fields); finish a model card Thursday (intended use, metrics, limits, monitoring).
Reproducibility spine — Tuesday includes a concrete folder layout + “one command rebuilds figures” goal (quarto project; optional renv / conda lockfile for a reproducibility homework sprint).
Fairness as a metric choice — Thursday explicitly compares overall vs subgroup metrics on Palmer Penguins (e.g. species or island as a subgroup lens) when a sensitive attribute exists (or simulate site/batch as a stand-in).
Assessment that teaches humility — include a short oral or written prompt: “What would you not claim from this SHAP summary?” graded with a published mini-rubric (correctness, diagnostics, communication).

Day 1 (Monday): Foundations and Regularization

Learning goals

Understand supervised learning setup for regression and classification on genes (simulation) and Palmer Penguins (real)
Use linear and logistic regression to frame overfitting vs underfitting
Learn why regularization is needed and when to use ridge vs lasso
Draft a data card (sources, leakage, subgroups) and connect it to train/validation design

Sequence

Framing predictive modeling: data split, validation, and generalization (tie each choice to the data card)
Linear regression as baseline and bias–variance discussion (synthetic gene–trait story in notebooks; penguins for real EDA in parallel)
Logistic regression for classification intuition (same patient/scientific framing as Thursday metrics)
Overfitting/underfitting diagnostics (learning curves; train vs validation) on a held-out split (genes or penguins)
Regularization: ridge and lasso; cross-validation (LOO and (k)-fold figures) to choose (); shrinkage paths; preview elastic net — Module 02 + Day 1 CV slides
Calibration intuition (light) — reliability of predicted probabilities even within linear models; sets up Thursday without heavy math
Transition to algorithmic methods + assign reproducibility mini-task (directory + render script)

Day 2 (Tuesday): Trees, rpart, and the tidymodels Pipeline

Learning goals

Compare mathematical and algorithmic modeling mindsets (two cultures)
Build intuition for decision trees (rpart) on gene simulations and Palmer Penguins
Master recipe → spec → workflow → tune → fit with decision_tree + rpart and accuracy on CV
Establish reproducible project layout (microbiome lab)

Sequence

Breiman two cultures; gene rpart (classification + regression)
Incremental pipeline walkthrough on penguins (day02-tidymodels-walkthrough.qmd — also Module 04)
tune_grid for tree_depth and min_n; tuning plot, tree plot, light VIP
Microbiome lab: same grammar on OTU data (grouped CV)

Day 4 (Thursday): Models, Preprocessing, Metrics, and Importance

Deck: day-04-thursday.html (includes: day04-*.qmd)

Learning goals

Swap parsnip specs (forest, boosting, MLP) on Tuesday’s rec_base
PCA, imputation, resampling as recipe steps; accuracy until imbalance
Metrics beyond accuracy, confusion matrices, ROC/PR on imbalanced data
VIP and non-causal interpretation (Module 06)

Sequence (matches slides)

Part A: recap setup + stressed scenarios
Part B: model catalog; bagging/boosting/MLP; multi-model fit_resamples (accuracy)
Part C–D: step_pca(), step_impute_*() (accuracy)
Part E: imbalance, metrics toolbox, confusion matrices, compare upsampling
Part F: VIP on random forest; SHAP on penguin sex (no species) — game-theory intuition, beeswarm + waterfall, interpretation guardrails (Module 06)

Delivery Notes

Exercises are run separately from this hub but should reference Palmer Penguins (and optionally gene simulations) and reuse the data/model card templates.
Each day has one dedicated revealjs deck; optional live R snippets or board work replace separate app demos.
The same tidymodels pipeline grammar is reused from Day 2 (Tuesday) onward; Day 4 (Thursday) swaps engines and adds recipe steps (PCA, impute, upsample), not project structure.
caret removed from the hub; splits/scaling/metrics use tidymodels / yardstick in slides, modules, and notebooks.
Student .Rmd notebooks live under notebooks/: gene pair unchanged; penguin pair added; align exercises with the same penguin export where possible.

Optional extensions (only if time remains)

Deeper causal half-day (DAGs, backdoor adjustment) as a parallel track for quantitatively mature audiences.
Guest deployment expanded into a full lab with authentication and logging.
Versioned TA grading rubrics published to students day-one for every notebook hand-in.