Day 1 (Monday): Foundations and Regularization

Part A - Setup

Tools we use this week

  • R - language for statistics, graphics, and reproducible scripts
  • RStudio or Positron - console, editor, projects, and Render for .qmd / .Rmd
  • tidyverse - shared data and plot grammar (dplyr, ggplot2, …); you will see the pipe |> in notebooks
  • tidymodels - recipes, workflows, and resampling for modeling pipelines (Tuesday onward); Monday stays mostly familiar lm / glm
  • Install palmerpenguins before Tuesday - we use it in the penguin notebooks, not in today’s gene simulations

Details and installs: Student notebooks

Overview

  • Day 1 (Monday): classical statistics recap; toy linear regression; Gene × trait and Gene × disease (null → full → AIC → lasso)
  • Day 2 (Tuesday): two cultures; decision trees; Palmer Penguins train/test holdout and tidymodels pipeline
  • Day 3 (Wednesday): mini symposium and social event (no lecture deck on this site)
  • Day 4 (Thursday): forests, boosting, neural nets; PCA / impute / upsample; interpretability and metrics

Learning objectives

  • Navigate RStudio / Positron: open a project, load a table, and make a simple ggplot2 figure
  • Know which other tools we use this week (tidyverse, tidymodels from Tuesday onward)
  • Work through two synthetic gene datasets (continuous trait; binary disease) with known ground truth
  • See underfitting (intercept-only) vs overfitting risk (all predictors) on the same data
  • Use forward and backward AIC selection, then lasso as a continuous alternative
  • Explain cross-validation (leave-one-out and k-fold) for choosing tuning parameters such as lasso \lambda (cv.glmnet) - not a final test-set lockbox today

RStudio / Positron intro for Monday slides. Packages: tidyverse (base R faithful for a quick demo).

R and RStudio

  • R - the language: objects, functions, statistics, graphics
  • RStudio Desktop - a workbench around R (editor, console, plots, files)
  • Positron - Posit’s newer IDE; same habits (script → console → plots)
  • R Project - open the course folder as a project so paths and history stay in one place

RStudio layout (four panes)

Source / Editor - write .R scripts; run lines with shortcuts Environment - data frames and objects you have created
Console - type commands; see output and errors Files / Plots / Packages / Help - browse files; view figures; install packages

Figure 1

Install R, then RStudio

  1. R from CRAN (or use your lab’s managed install)
  2. RStudio Desktop from Posit - RStudio download
  3. Open RStudio → Console → check it works:
R.version.string

Scripts: save your work

  • File → New File → R Script (or Ctrl/Cmd + Shift + N)
  • Put the cursor on a line → Run sends it to the Console:
    • Windows / Linux: Ctrl + Enter
    • macOS: Cmd + Enter
  • Lines starting with # are comments (notes for humans; R ignores them)
  • Save the .R file and re-run from the top when you change something - that is reproducibility in miniature

Old Faithful - our ggplot demo data

The built-in faithful table records eruption duration and waiting time between eruptions at Yellowstone (minutes). Data card

Photo: Old Faithful geyser, Yellowstone (Wikimedia Commons, CC BY-SA 4.0).

Load data: built-in table

Any rectangular table works. Here we use faithful - no extra package required:

library(tidyverse)

faithful_df <- as_tibble(faithful)

Also common in projects: a file in your project folder, read with read_csv() from readr (part of tidyverse):

# my_table <- read_csv("data/my_measurements.csv")

Paths are relative to the project root when you use an RStudio Project.

First look: glimpse()

glimpse(faithful_df)
Rows: 272
Columns: 2
$ eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.95…
$ waiting   <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83, …

ggplot2 Grammar

Data and aesthetic mappings (aes: x, y, colour, …) +

geometric layers (geom_point, geom_smooth, …) +

labs() +

theme_*(),

Combine different steps with + at end of line, new command can start in next line.

Example plot - scatter

ggplot(faithful_df, aes(x = eruptions, y = waiting)) +
  geom_point(alpha = 0.7, size = 2, color = "steelblue") +
  labs(
    title = "Old Faithful: eruption length vs waiting time",
    x = "Eruption duration (min)",
    y = "Waiting time (min)"
  ) +
  theme_minimal()

Example plot - add a layer (geom_smooth)

Each + adds another layer or adjustment:

ggplot(faithful_df, aes(x = eruptions, y = waiting)) +
  geom_point(alpha = 0.6, size = 1.8, color = "gray40") +
  geom_smooth(method = "lm", se = TRUE, color = "firebrick", linewidth = 0.8) +
  labs(
    title = "Same data with a linear smooth",
    x = "Eruption duration (min)",
    y = "Waiting time (min)"
  ) +
  theme_minimal()

Notebook exercise

Open any notebook from the notebook index and try Render in RStudio.

Bridge: from tooling to modeling

You now have the R/tidyverse toolkit for loading data and plotting. Part B switches to classical linear models: what we estimate, how we interpret coefficients, and when plain OLS struggles.

Afternoon lab — Task 1.1

Tip

Load and prepare the data — merge metadata and OTU table on sample_id; transpose to samples × OTUs; prevalence filter; inspect Label, Sex, and Individual.

Lab exercises · Solution · Download Rmd

Part B - Classical linear models

Classical statistics

  • Data are noisy; models are simplifying mathematical statements about patterns, not automatic truths about biology

  • We estimate unknown quantities and (when the design supports it) quantify uncertainty (standard errors, intervals)

  • Association \neq causation - a significant coefficient is still conditional on everything else in the model and on what was actually measured

Classical vs predictive wording

  • Classical question: Given a regression model, what do we learn about the slope \beta_j, and how fragile is it?

  • Answer is usually an estimate plus a confidence interval.

  • Prediction question: If we freeze this fitted equation, how well does it score on new data points?

  • Answer is usually a model and its performance on new unseen data.

  • Today: all models are fit and compared on the same simulated samples; Tuesday introduces held-out evaluation (i.e., on new, unseen data)

Multiple linear regression

  • Outcome y;
  • predictors x_1, \ldots, x_p
  • Linear model: \mathbb{E}(y \mid X) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p

additivity in the chosen predictor columns

Key assumptions

  • Mean structure: if the truth is a non-linear pattern, a single plane leaves patterned residuals - check the residuals vs fitted slide above
  • Independent measurements: violated by repeated measures, batches, or spatial proximity - then naive confidence intervals become inaccurate
  • Comparable error spread across fitted values (heteroscedasticity \Rightarrow cautious intervals)
  • Roughly normal errors - check the Q-Q plot; matters most for small-n confidence intervals
  • Influence: a handful of extreme points can pull slopes - always look at diagnostics before you tell a story

One explanatory variable: line + scatter

Each vertical “slice” is a Normal density centered on the fitted line - spread of y at a given x.

Fit the model

Fitting is done by minimizing sum of squared residuals (oridnary least squares, OLS).

Figure 2

Q-Q plot of residuals

Checks whether residuals look roughly Normal (one assumption behind classical intervals).

Residuals vs fitted values

Look for pattern (curve, funnel shape) - not just random scatter around zero.

Two predictors: a plane in 3D

Multiple regression with two predictors is a plane in (x_1, x_2, y) space (more predictors → hyperplane).

The meaning of regression coefficients \hat{\beta}_j

  • Each slope is “holding other predictors constant” - interpretable only in that mathematical sense, not as a controlled experiment unless design matches
  • Correlated predictors inflate standard errors and swap signs of coefficients easily - regularization (lasso, later today) is one way to handle such data
  • Many predictors, modest n: plain ordinary least sqaures (OLS) is unstable or non-unique

The data (MASS::cement)

Data card

Hald cement (first 5 of 13 rows)
x1 x2 x3 x4 y
7 26 6 60 78.5
1 29 15 52 74.3
11 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
  • Outcome y: heat evolved (calories per gram cement) - left on the original scale
  • Predictors x1-x4: oxide composition - we standardize them (scale()) so slopes are comparable and units do not dominate the fit

Standardize predictors first

hald_scaled[, c("x1", "x2", "x3", "x4")] <- scale(hald[, c("x1", "x2", "x3", "x4")])
# All regressions below use hald_scaled (z-scored x1-x4, original y)
  • Each \hat\beta_j is the change in y (heat) per 1 SD increase in x_j, holding other predictors fixed
  • Correlations among predictors are unchanged by scaling - but coefficient tables are easier to read

One predictor at a time vs all four together

Simple vs four-predictor model (scaled predictors)
predictor simple_estimate simple_p full_estimate full_p simple_sig full_sig
x1 10.992711 0.004552 9.124198 0.070822 yes no
x2 12.279477 0.000665 7.938657 0.500901 yes no
x3 -8.043437 0.059762 0.652743 0.895923 no no
x4 -12.355485 0.000576 -2.411319 0.844071 yes no
  • Simple fits: x1, x2, x4 significant; x3 borderline
  • All four together: no predictor significant at \alpha = 0.05 (x1 borderline) - joint inference breaks down when columns overlap (see full_estimate / full_p in the table above)

Correlations among x1-x4

Strong negative pairs: x2-x4 (r \approx -0.97), x1-x3 (r \approx -0.82). x3-x4 are nearly uncorrelated (r \approx 0.03); x1-x4 weakly negative (r \approx -0.25).

Takeaway

  • Multiple linear regression is not the same as multiple simple linear regressions

  • Signs and p-values can flip when you add or drop correlated columns

  • Which is the “correct” model?

  • Later today: lasso and AIC help navigate this kind of data

Bridge: from toy regression to high-dimensional genes

Part C keeps a continuous trait but adds many correlated predictors — the setting where AIC stepwise and lasso matter.

Part C - Gene Expression (continuous response variable)

A synthetic dataset

  • We use a synthetic dataset for illustrative purposes. This has the advantage that we know exactly what’s going on.
  • In this simulated dataset, GeneA and GeneB are genes that directly influence a specific biological trait.
  • GeneC and GeneD are noise genes that are correlated with the trait but do not contribute causally.
  • GeneJ is in strongly correlated to GeneA but not causally linked to the trait.
  • Other genes act as irrelevant noise variables or have correlations among themselves.
  • trait = -0.5 * GeneA + 0.8 * GeneB + N(0, 0.3)
  • more details: Linear regression + LASSO (genes)

Gene × trait: the dataset

GeneA × GeneB - ground-truth surface

Pairwise Plots

Distribution of explanatory variables (gene expression levels) and response variable (quantitative trait)

Figure 3

Overfitting vs underfitting

  • Underfitting: model too simple - high error even on data used to fit (e.g. intercept-only, ignores genes)

  • Overfitting risk: model very flexible - can fit noise; unstable coefficients when predictors overlap

  • Null model (trait ~ 1): one parameter, ignores all genes → underfitting

  • Full model (trait ~ .): all genes → lower training error, but many correlated predictors → shaky \hat{\beta}_j

  • Goal: find a complexity level that balances fit and stability

Null model (intercept only)

Too simple: ignores genes → underfitting.

Null model: intercept only (scaled trait)
term estimate std.error statistic p.value
(Intercept) 0.0352 0.0742 0.4747 0.6357

Single intercept; no gene effects in the model.

Full model (all genes)

Every gene in the model - flexible, but many correlated predictors → unstable coefficients (overfitting risk).

Null vs full: in-sample fit

AIC: backward and forward selection

Backward AIC formula: trait ~ GeneA + GeneB + GeneC + GeneD

Forward AIC formula: trait ~ GeneB + GeneA + GeneD + GeneC

AIC vs full OLS: coefficients (+/- 2 SE)

Dashed lines: known effects for GeneB (0.8) and GeneA (-0.5).

Bridge: from AIC to lasso

AIC searches over discrete subsets of predictors. Lasso keeps one smooth penalty knob (\lambda) and often sets many coefficients to exactly zero.

Lasso as an alternative to discrete search

  • Idea: penalize large coefficients: lasso adds \lambda \sum_j |\beta_j| penalty to OLS
  • \lambda controls strength: larger \lambda \Rightarrow simpler model (more shrinkage toward zero)
  • Discrete AIC picks variables in or out; lasso shrinks many coefficients toward exactly zero as \lambda grows
  • How to choose \lambda? Cross-validation on the next slides - then cv.glmnet and the CV error curve

Why cross-validation?

  • Lasso has a tuning knob \lambda: too small → overfits; too large → underfits.
  • Fitting and scoring on the same rows makes every \lambda look artificially good.
  • Cross-validation (CV) repeatedly holds out part of the data, fits on the rest, scores on the holdout, then averages those scores.

The CV loop (in words)

  1. Split rows into training and validation (one row for LOO, one fold for k-fold).
  2. fit(model, train) on training rows only (for lasso: at a fixed \lambda).
  3. predict(model, validation) on held-out rows; score error (MSE or deviance).
  4. Repeat for each holdout split; average errors across rounds.
  5. Compare averages across a grid of \lambda values; pick the winner (or use the 1-SE rule in cv.glmnet).
  • This is not the same as a final test set lockbox (Tuesday): today we use CV only to choose \lambda on the same n rows we already have.

Leave-one-out CV (LOO-CV)

  • n rounds for n rows: each round holds out exactly one row as validation, trains on the other n - 1.
  • Pros: uses almost all data each round; standard for cv.glmnet(..., nfold = n).
  • Cons: n fits per \lambda — slow when n is large.
  • Red = training rows; light blue = held-out row (fitpredictscore); rotate through all n rounds, then average errors.

LOO-CV animation

Animation: LOOCV.gif by MBanuelos22, Wikimedia Commons, CC BY-SA 4.0.

k-fold CV

  • Partition rows into k folds (often k = 5 or 10). Each round, one fold is validation and the other k - 1 folds are training.
  • Pros: only k fits per \lambda — faster than LOO for large n.
  • Cons: slightly more variable than LOO because each fit sees fewer rows.
  • Tuesday: vfold_cv() in tidymodels automates this for trees and other models.
  • Each column below is one CV round: fit on training folds (dark), predict held-out fold (light), score error, then average across rounds.

k-fold CV schematic

Figure: K-fold cross validation EN.svg by Gufosowa, Wikimedia Commons, CC BY-SA 4.0.

Lasso: pick \lambda with LOO-CV

  • Try many \lambda values; for each, cv.glmnet averages LOO prediction error.
  • lambda.min: lowest CV error; lambda.1se: simpler model within one standard error of the minimum (our default).
  • Lab 1D (optional): manual 5-fold loop vs cv.glmnet — same idea, different k. Archived write-up: manual k-fold CV.

Lasso: CV error vs \lambda (from cv.glmnet)

Lasso coefficient paths

Summary: all approaches (in-sample, same data)

In-sample comparison on all n samples.

Model n_coef AIC RMSE_in R2_in
Null (intercept only) 1 399.889 0.905 0.000
OLS full 11 82.346 0.294 0.895
AIC backward 5 73.024 0.297 0.893
AIC forward 5 73.024 0.297 0.893
Lasso (lambda.1se, LOO-CV) 3 NA 0.314 0.890

Afternoon lab — Task 1.2

Tip

Transform counts — apply log1p to the OTU matrix; optionally plot library size by Label.

Lab exercises · Solution · Download Rmd

Part D - Gene × disease (binary)

Another synthetic data set

  • Binary response variable, indicating the presence or absence of a disease trait
  • Number of Samples (n): 200
  • Number of Features (p): 10 genes (named Gene1 to Gene10)
  • Gene Expression Data: Simulated as normally distributed random values.

Causal Genes:

  • Gene1: Positive effect with a coefficient of 0.5, plus interaction with Gene3.
  • Gene3: Positive effect with a coefficient of 0.8, plus interaction with Gene1.
  • Gene4: Negative effect with a coefficient of -0.3.

Another synthetic data set (cont.)

Response variable generated based on a linear combination of Gene1, Gene3, and Gene4 with some added noise:

disease_presence <- ifelse(
0.5 * Gene1 +  0.8 * Gene3 - 0.3 * Gene4 - 
 2  * Gene1 * Gene3 * + N(0,0.5) > 0, 
1, 0)

Logistic regression

We can predict binary outcomes using logistic regression

  • Outcome y \in \{0, 1\} (here: disease absent / present)
  • Model log-odds as linear in predictors: \log\!\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots
  • Gives probabilities in (0, 1)
  • Classify with a threshold probability (often 0.5)
  • Same modeling workflow as in the qualitative-trait notebook: null → full → AIC → lasso on all samples

Notebook: logistic-regression-gene-disease.Rmd

Afternoon lab — Task 1.3

Tip

Explore with PCAprcomp on scaled log1p OTUs; plot PC1 vs PC2 coloured by Label.

Lab exercises · Solution · Download Rmd

Full logistic model: what it misses

  • Cannot capture the Gene1 × Gene3 interaction (misses Gene1’s role)
  • Picks some signal genes but also a noise gene (Gene8)

Null model (intercept only)

Predicts the overall disease rate - ignores role of genes → underfitting.

Null GLM: intercept only (overall disease rate)
term estimate std.error statistic p.value
(Intercept) 0.2819 0.1428 1.9734 0.0485

Single intercept; no gene effects in the model.

Full logistic model (all genes)

Logistic decision boundary (Gene1 × Gene3)

Smooth P(disease) regions - compare to axis-aligned tree partitions on Tuesday (white line = P = 0.5).

Afternoon lab — Task 1.4

Tip

Logistic regression on PCs — Early vs Lateglm(Label ~ PC1 + …, family = binomial) on a few principal components (not raw OTUs).

Lab exercises · Solution · Download Rmd

Null vs full: in-sample misclassification

AIC: backward and forward selection

Backward AIC formula: disease_presence ~ Gene1 + Gene3 + Gene4 + Gene8

Forward AIC formula: disease_presence ~ Gene3 + Gene4 + Gene8 + Gene1

Afternoon lab — Task 1.5

Tip

Stepwise AIC — Early vs LateMASS::stepAIC on the PC logistic model; compare to the full PC formula.

Lab exercises · Solution · Download Rmd

Lasso for classification

  • Same \lambda and LOO-CV story as the trait example (cross-validation) — glmnet(..., family = "binomial", nfold = n)
  • Then compare null, full glm, AIC models, and lasso in one in-sample summary table

Lasso logistic: cross-validated choice of \lambda

Afternoon lab — Task 1.6

Tip

Lasso — Early vs Lateglmnet (binomial) on all log1p OTUs; compare sparse OTUs to the AIC model (no PCA needed for lasso).

Lab exercises · Solution · Download Rmd

Summary: all approaches (in-sample, same data)

In-sample classification on all n samples.
Model n_coef AIC Accuracy_in Error_rate_in
Null (intercept only) 1 275.326 0.570 0.430
GLM full 11 249.190 0.740 0.260
AIC backward 5 241.608 0.750 0.250
AIC forward 5 241.608 0.750 0.250
Lasso (lambda.1se, LOO-CV) 3 NA 0.675 0.325

Afternoon lab — Task 1.7

Tip

Repeat for Sex — run tasks 1.4–1.6 with outcome Sex (F vs M).

Lab exercises · Solution · Download Rmd

Elastic net (bridge between ridge and lasso)

  • glmnet uses mixing parameter \alpha \in [0, 1]: \alpha = 0 is pure ridge, \alpha = 1 is pure lasso, values in between blend L_1 and L_2 penalties.
  • Useful when you want some grouping/shrinkage of correlated predictors (ridge-like) but still allow some exact zeros (lasso-like).

Part E - Wrap-up

Afternoon lab — Task 1.8

Tip

Compare approaches — summary table: method × outcome × # predictors × in-sample accuracy; one sentence on PCA+glm vs lasso.

Lab exercises · Solution · Download Rmd

Student notebooks (Day 1 focus)

Have a look at notebooks. Open one and try to render in RStudio.

End-of-day checklist

  • Can you explain null vs full on the same data - which underfits, which risks overfitting?
  • Can you contrast discrete model search (AIC) with continuous shrinkage (lasso)?
  • What does \lambda do? Contrast leave-one-out vs k-fold CV for choosing it.
  • Did you open at least one student .Rmd and run all chunks?
  • Does the notebook make sense to you?
  • Do you understand what is happening in the notebooks?