Curriculum Outline
Course week (Days 1–4)
Teaching decks on this site: Day 1 (Monday), Day 2 (Tuesday), and Day 4 (Thursday). Day 3 (Wednesday) is a mini symposium and social event — no lecture deck here.
This outline links to afternoon lab exercises (numbered tasks and one solution notebook per teaching day) while weaving a few ideas through the week: two parallel examples (simulated genes where we know the truth + Palmer Penguins as one real table), what a wrong prediction costs (thresholds, prevalence), calibration, “correlation is not causation” when reading importance plots, short data and model summaries, reproducibility, fairness-style subgroup checks, and assignments that reward honest wording (especially around SHAP).
Integrated design principles (built into the week)
- Two parallel data tracks — Synthetic genes in the original linear / logistic notebooks: we scripted the biology so we know which coefficients should matter. Palmer Penguins (data card, catalog, dedicated notebooks, Tuesday’s full modeling walk-through in R using
tidymodels; reading companion for Chapters 4, 7, and 8): one small real table reused Mon→Thu for plots, regression, classification, and workflows. - Decision costs before a long list of metrics — introduce thresholds, prevalence, and misclassification costs early; return to them on Thursday so ROC / PR / F1 choices feel motivated, not arbitrary.
- Association ≠ intervention — embed a short “causal guardrails” beat when introducing importance and SHAP so plots are not read as policy levers by default.
- Data cards + model cards — start a one-page data card Monday (where the numbers came from, leakage risks, subgroup fields); finish a model card Thursday (intended use, metrics, limits, monitoring).
- Reproducibility spine — Tuesday includes a concrete folder layout + “one command rebuilds figures” goal (
quartoproject; optionalrenv/condalockfile for a reproducibility homework sprint). - Fairness as a metric choice — Thursday explicitly compares overall vs subgroup metrics on Palmer Penguins (e.g.
speciesorislandas a subgroup lens) when a sensitive attribute exists (or simulate site/batch as a stand-in). - Assessment that teaches humility — include a short oral or written prompt: “What would you not claim from this SHAP summary?” graded with a published mini-rubric (correctness, diagnostics, communication).
Day 1 (Monday): Foundations and Regularization
Learning goals
- Understand supervised learning setup for regression and classification on genes (simulation) and Palmer Penguins (real)
- Use linear and logistic regression to frame overfitting vs underfitting
- Learn why regularization is needed and when to use ridge vs lasso
- Draft a data card (sources, leakage, subgroups) and connect it to train/validation design
Sequence
- Framing predictive modeling: data split, validation, and generalization (tie each choice to the data card)
- Linear regression as baseline and bias–variance discussion (synthetic gene–trait story in notebooks; penguins for real EDA in parallel)
- Logistic regression for classification intuition (same patient/scientific framing as Thursday metrics)
- Overfitting/underfitting diagnostics (learning curves; train vs validation) on a held-out split (genes or penguins)
- Regularization: ridge and lasso; cross-validation (LOO and (k)-fold figures) to choose (); shrinkage paths; preview elastic net — Module 02 + Day 1 CV slides
- Calibration intuition (light) — reliability of predicted probabilities even within linear models; sets up Thursday without heavy math
- Transition to algorithmic methods + assign reproducibility mini-task (directory + render script)
Day 2 (Tuesday): Trees, rpart, and the tidymodels Pipeline
Learning goals
- Compare mathematical and algorithmic modeling mindsets (two cultures)
- Build intuition for decision trees (
rpart) on gene simulations and Palmer Penguins - Master
recipe → spec → workflow → tune → fitwithdecision_tree+rpartand accuracy on CV - Establish reproducible project layout (microbiome lab)
Sequence
- Breiman two cultures; gene
rpart(classification + regression) - Incremental pipeline walkthrough on penguins (
day02-tidymodels-walkthrough.qmd— also Module 04) tune_gridfortree_depthandmin_n; tuning plot, tree plot, light VIP- Microbiome lab: same grammar on OTU data (grouped CV)
Day 4 (Thursday): Models, Preprocessing, Metrics, and Importance
Deck: day-04-thursday.html (includes: day04-*.qmd)
Learning goals
- Swap
parsnipspecs (forest, boosting, MLP) on Tuesday’srec_base - PCA, imputation, resampling as recipe steps; accuracy until imbalance
- Metrics beyond accuracy, confusion matrices, ROC/PR on imbalanced data
- VIP and non-causal interpretation (Module 06)
Sequence (matches slides)
- Part A: recap setup + stressed scenarios
- Part B: model catalog; bagging/boosting/MLP; multi-model
fit_resamples(accuracy) - Part C–D:
step_pca(),step_impute_*()(accuracy) - Part E: imbalance, metrics toolbox, confusion matrices, compare upsampling
- Part F: VIP on random forest; SHAP on penguin sex (no species) — game-theory intuition, beeswarm + waterfall, interpretation guardrails (Module 06)
Delivery Notes
- Exercises are run separately from this hub but should reference Palmer Penguins (and optionally gene simulations) and reuse the data/model card templates.
- Each day has one dedicated revealjs deck; optional live R snippets or board work replace separate app demos.
- The same
tidymodelspipeline grammar is reused from Day 2 (Tuesday) onward; Day 4 (Thursday) swaps engines and adds recipe steps (PCA, impute, upsample), not project structure. caretremoved from the hub; splits/scaling/metrics usetidymodels/yardstickin slides, modules, and notebooks.- Student
.Rmdnotebooks live undernotebooks/: gene pair unchanged; penguin pair added; align exercises with the same penguin export where possible.
Optional extensions (only if time remains)
- Deeper causal half-day (DAGs, backdoor adjustment) as a parallel track for quantitatively mature audiences.
- Guest deployment expanded into a full lab with authentication and logging.
- Versioned TA grading rubrics published to students day-one for every notebook hand-in.