Day 1 (Monday): Foundations and Regularization

Part A - Setup

Tools we use this week

R - language for statistics, graphics, and reproducible scripts
RStudio or Positron - console, editor, projects, and Render for .qmd / .Rmd
tidyverse - shared data and plot grammar (dplyr, ggplot2, …); you will see the pipe |> in notebooks
tidymodels - recipes, workflows, and resampling for modeling pipelines (Tuesday onward); Monday stays mostly familiar lm / glm
Install palmerpenguins before Tuesday - we use it in the penguin notebooks, not in today’s gene simulations

Details and installs: Student notebooks

Overview

Day 1 (Monday): classical statistics recap; toy linear regression; Gene × trait and Gene × disease (null → full → AIC → lasso)
Day 2 (Tuesday): two cultures; decision trees; Palmer Penguins train/test holdout and tidymodels pipeline
Day 3 (Wednesday): mini symposium and social event (no lecture deck on this site)
Day 4 (Thursday): forests, boosting, neural nets; PCA / impute / upsample; interpretability and metrics

Learning objectives

Navigate RStudio / Positron: open a project, load a table, and make a simple ggplot2 figure
Know which other tools we use this week (tidyverse, tidymodels from Tuesday onward)
Work through two synthetic gene datasets (continuous trait; binary disease) with known ground truth
See underfitting (intercept-only) vs overfitting risk (all predictors) on the same data
Use forward and backward AIC selection, then lasso as a continuous alternative
Explain cross-validation (leave-one-out and k-fold) for choosing tuning parameters such as lasso \lambda (cv.glmnet) - not a final test-set lockbox today

RStudio / Positron intro for Monday slides. Packages: tidyverse (base R faithful for a quick demo).

R and RStudio

R - the language: objects, functions, statistics, graphics
RStudio Desktop - a workbench around R (editor, console, plots, files)
Positron - Posit’s newer IDE; same habits (script → console → plots)
R Project - open the course folder as a project so paths and history stay in one place

RStudio layout (four panes)

Source / Editor - write `.R` scripts; run lines with shortcuts	Environment - data frames and objects you have created
Console - type commands; see output and errors	Files / Plots / Packages / Help - browse files; view figures; install packages

Figure 1

Install R, then RStudio

R from CRAN (or use your lab’s managed install)
RStudio Desktop from Posit - RStudio download
Open RStudio → Console → check it works:

R.version.string

Scripts: save your work

File → New File → R Script (or Ctrl/Cmd + Shift + N)
Put the cursor on a line → Run sends it to the Console:
- Windows / Linux: Ctrl + Enter
- macOS: Cmd + Enter
Lines starting with # are comments (notes for humans; R ignores them)
Save the .R file and re-run from the top when you change something - that is reproducibility in miniature

Old Faithful - our ggplot demo data

The built-in faithful table records eruption duration and waiting time between eruptions at Yellowstone (minutes). Data card

Photo: Old Faithful geyser, Yellowstone (Wikimedia Commons, CC BY-SA 4.0).

Load data: built-in table

Any rectangular table works. Here we use faithful - no extra package required:

library(tidyverse)

faithful_df <- as_tibble(faithful)

Also common in projects: a file in your project folder, read with read_csv() from readr (part of tidyverse):

# my_table <- read_csv("data/my_measurements.csv")

Paths are relative to the project root when you use an RStudio Project.

First look: `glimpse()`

glimpse(faithful_df)

Rows: 272
Columns: 2
$ eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.95…
$ waiting   <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83, …

ggplot2 Grammar

Data and aesthetic mappings (aes: x, y, colour, …) +

geometric layers (geom_point, geom_smooth, …) +

labs() +

theme_*(),

Combine different steps with + at end of line, new command can start in next line.

Example plot - scatter

ggplot(faithful_df, aes(x = eruptions, y = waiting)) +
  geom_point(alpha = 0.7, size = 2, color = "steelblue") +
  labs(
    title = "Old Faithful: eruption length vs waiting time",
    x = "Eruption duration (min)",
    y = "Waiting time (min)"
  ) +
  theme_minimal()

Example plot - add a layer (`geom_smooth`)

Each + adds another layer or adjustment:

ggplot(faithful_df, aes(x = eruptions, y = waiting)) +
  geom_point(alpha = 0.6, size = 1.8, color = "gray40") +
  geom_smooth(method = "lm", se = TRUE, color = "firebrick", linewidth = 0.8) +
  labs(
    title = "Same data with a linear smooth",
    x = "Eruption duration (min)",
    y = "Waiting time (min)"
  ) +
  theme_minimal()

Notebook exercise

Open any notebook from the notebook index and try Render in RStudio.

Bridge: from tooling to modeling

You now have the R/tidyverse toolkit for loading data and plotting. Part B switches to classical linear models: what we estimate, how we interpret coefficients, and when plain OLS struggles.

Afternoon lab — Task 1.1

Tip

Load and prepare the data — merge metadata and OTU table on sample_id; transpose to samples × OTUs; prevalence filter; inspect Label, Sex, and Individual.

Lab exercises · Solution · Download Rmd

Part B - Classical linear models

Classical statistics

Data are noisy; models are simplifying mathematical statements about patterns, not automatic truths about biology
We estimate unknown quantities and (when the design supports it) quantify uncertainty (standard errors, intervals)
Association \neq causation - a significant coefficient is still conditional on everything else in the model and on what was actually measured

Classical vs predictive wording

Classical question: Given a regression model, what do we learn about the slope \beta_j, and how fragile is it?
Answer is usually an estimate plus a confidence interval.
Prediction question: If we freeze this fitted equation, how well does it score on new data points?
Answer is usually a model and its performance on new unseen data.
Today: all models are fit and compared on the same simulated samples; Tuesday introduces held-out evaluation (i.e., on new, unseen data)

Multiple linear regression

Outcome y;
predictors x_1, \ldots, x_p
Linear model: \mathbb{E}(y \mid X) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p

additivity in the chosen predictor columns

Key assumptions

Mean structure: if the truth is a non-linear pattern, a single plane leaves patterned residuals - check the residuals vs fitted slide above
Independent measurements: violated by repeated measures, batches, or spatial proximity - then naive confidence intervals become inaccurate
Comparable error spread across fitted values (heteroscedasticity \Rightarrow cautious intervals)
Roughly normal errors - check the Q-Q plot; matters most for small-n confidence intervals
Influence: a handful of extreme points can pull slopes - always look at diagnostics before you tell a story

One explanatory variable: line + scatter

Each vertical “slice” is a Normal density centered on the fitted line - spread of y at a given x.

Fit the model

Fitting is done by minimizing sum of squared residuals (oridnary least squares, OLS).

Figure 2

Q-Q plot of residuals

Checks whether residuals look roughly Normal (one assumption behind classical intervals).

Residuals vs fitted values

Look for pattern (curve, funnel shape) - not just random scatter around zero.

Two predictors: a plane in 3D

Multiple regression with two predictors is a plane in (x_1, x_2, y) space (more predictors → hyperplane).

The meaning of regression coefficients \hat{\beta}_j

Each slope is “holding other predictors constant” - interpretable only in that mathematical sense, not as a controlled experiment unless design matches
Correlated predictors inflate standard errors and swap signs of coefficients easily - regularization (lasso, later today) is one way to handle such data
Many predictors, modest n: plain ordinary least sqaures (OLS) is unstable or non-unique

The data (`MASS::cement`)

Data card

Hald cement (first 5 of 13 rows)
x1	x2	x3	x4	y
7	26	6	60	78.5
1	29	15	52	74.3
11	56	8	20	104.3
11	31	8	47	87.6
7	52	6	33	95.9

Outcome y: heat evolved (calories per gram cement) - left on the original scale
Predictors x1-x4: oxide composition - we standardize them (scale()) so slopes are comparable and units do not dominate the fit

Standardize predictors first

hald_scaled[, c("x1", "x2", "x3", "x4")] <- scale(hald[, c("x1", "x2", "x3", "x4")])
# All regressions below use hald_scaled (z-scored x1-x4, original y)

Each \hat\beta_j is the change in y (heat) per 1 SD increase in x_j, holding other predictors fixed
Correlations among predictors are unchanged by scaling - but coefficient tables are easier to read

One predictor at a time vs all four together

Simple vs four-predictor model (scaled predictors)
predictor	simple_estimate	simple_p	full_estimate	full_p	simple_sig	full_sig
x1	10.992711	0.004552	9.124198	0.070822	yes	no
x2	12.279477	0.000665	7.938657	0.500901	yes	no
x3	-8.043437	0.059762	0.652743	0.895923	no	no
x4	-12.355485	0.000576	-2.411319	0.844071	yes	no

Simple fits: x1, x2, x4 significant; x3 borderline
All four together: no predictor significant at \alpha = 0.05 (x1 borderline) - joint inference breaks down when columns overlap (see full_estimate / full_p in the table above)

Correlations among `x1`-`x4`

Strong negative pairs: x2-x4 (r \approx -0.97), x1-x3 (r \approx -0.82). x3-x4 are nearly uncorrelated (r \approx 0.03); x1-x4 weakly negative (r \approx -0.25).

Takeaway

Multiple linear regression is not the same as multiple simple linear regressions
Signs and p-values can flip when you add or drop correlated columns
Which is the “correct” model?
Later today: lasso and AIC help navigate this kind of data

Bridge: from toy regression to high-dimensional genes

Part C keeps a continuous trait but adds many correlated predictors — the setting where AIC stepwise and lasso matter.

Part C - Gene Expression (continuous response variable)

A synthetic dataset

We use a synthetic dataset for illustrative purposes. This has the advantage that we know exactly what’s going on.
In this simulated dataset, GeneA and GeneB are genes that directly influence a specific biological trait.
GeneC and GeneD are noise genes that are correlated with the trait but do not contribute causally.
GeneJ is in strongly correlated to GeneA but not causally linked to the trait.
Other genes act as irrelevant noise variables or have correlations among themselves.
trait = -0.5 * GeneA + 0.8 * GeneB + N(0, 0.3)
more details: Linear regression + LASSO (genes)

Gene × trait: the dataset

GeneA × GeneB - ground-truth surface

Pairwise Plots

Distribution of explanatory variables (gene expression levels) and response variable (quantitative trait)

Figure 3

Overfitting vs underfitting

Underfitting: model too simple - high error even on data used to fit (e.g. intercept-only, ignores genes)
Overfitting risk: model very flexible - can fit noise; unstable coefficients when predictors overlap
Null model (trait ~ 1): one parameter, ignores all genes → underfitting
Full model (trait ~ .): all genes → lower training error, but many correlated predictors → shaky \hat{\beta}_j
Goal: find a complexity level that balances fit and stability

Null model (intercept only)

Too simple: ignores genes → underfitting.

Null model: intercept only (scaled trait)
term	estimate	std.error	statistic	p.value
(Intercept)	0.0352	0.0742	0.4747	0.6357

Single intercept; no gene effects in the model.

Full model (all genes)

Every gene in the model - flexible, but many correlated predictors → unstable coefficients (overfitting risk).

Discrete complexity control: AIC stepwise search

AIC trades log-likelihood against model size: \mathrm{AIC} = 2k - 2\log L. Lower is better on the same data.
Backward starts at the full model and drops terms if AIC improves.
Forward starts at the intercept-only model and adds terms greedily up to the full model.
Linear regression + LASSO (genes)

Null vs full: in-sample fit

AIC: backward and forward selection

Backward AIC formula: trait ~ GeneA + GeneB + GeneC + GeneD

Forward AIC formula: trait ~ GeneB + GeneA + GeneD + GeneC

AIC vs full OLS: coefficients (+/- 2 SE)

Dashed lines: known effects for GeneB (0.8) and GeneA (-0.5).

Bridge: from AIC to lasso

AIC searches over discrete subsets of predictors. Lasso keeps one smooth penalty knob (\lambda) and often sets many coefficients to exactly zero.

Lasso as an alternative to discrete search

Idea: penalize large coefficients: lasso adds \lambda \sum_j |\beta_j| penalty to OLS
\lambda controls strength: larger \lambda \Rightarrow simpler model (more shrinkage toward zero)
Discrete AIC picks variables in or out; lasso shrinks many coefficients toward exactly zero as \lambda grows
How to choose \lambda? Cross-validation on the next slides - then cv.glmnet and the CV error curve

Why cross-validation?

Lasso has a tuning knob \lambda: too small → overfits; too large → underfits.
Fitting and scoring on the same rows makes every \lambda look artificially good.
Cross-validation (CV) repeatedly holds out part of the data, fits on the rest, scores on the holdout, then averages those scores.

The CV loop (in words)

Split rows into training and validation (one row for LOO, one fold for k-fold).
fit(model, train) on training rows only (for lasso: at a fixed \lambda).
predict(model, validation) on held-out rows; score error (MSE or deviance).
Repeat for each holdout split; average errors across rounds.
Compare averages across a grid of \lambda values; pick the winner (or use the 1-SE rule in cv.glmnet).

This is not the same as a final test set lockbox (Tuesday): today we use CV only to choose \lambda on the same n rows we already have.

Leave-one-out CV (LOO-CV)

n rounds for n rows: each round holds out exactly one row as validation, trains on the other n - 1.
Pros: uses almost all data each round; standard for cv.glmnet(..., nfold = n).
Cons: n fits per \lambda — slow when n is large.
Red = training rows; light blue = held-out row (fit → predict → score); rotate through all n rounds, then average errors.

LOO-CV animation

Animation: LOOCV.gif by MBanuelos22, Wikimedia Commons, CC BY-SA 4.0.

k-fold CV

Partition rows into k folds (often k = 5 or 10). Each round, one fold is validation and the other k - 1 folds are training.
Pros: only k fits per \lambda — faster than LOO for large n.
Cons: slightly more variable than LOO because each fit sees fewer rows.
Tuesday: vfold_cv() in tidymodels automates this for trees and other models.
Each column below is one CV round: fit on training folds (dark), predict held-out fold (light), score error, then average across rounds.

k-fold CV schematic

Figure: K-fold cross validation EN.svg by Gufosowa, Wikimedia Commons, CC BY-SA 4.0.

Lasso: pick \lambda with LOO-CV

Try many \lambda values; for each, cv.glmnet averages LOO prediction error.
lambda.min: lowest CV error; lambda.1se: simpler model within one standard error of the minimum (our default).
Lab 1D (optional): manual 5-fold loop vs cv.glmnet — same idea, different k. Archived write-up: manual k-fold CV.

Lasso: CV error vs \lambda (from `cv.glmnet`)

Lasso coefficient paths

Summary: all approaches (in-sample, same data)

In-sample comparison on all n samples.

Model	n_coef	AIC	RMSE_in	R2_in
Null (intercept only)	1	399.889	0.905	0.000
OLS full	11	82.346	0.294	0.895
AIC backward	5	73.024	0.297	0.893
AIC forward	5	73.024	0.297	0.893
Lasso (lambda.1se, LOO-CV)	3	NA	0.314	0.890

Afternoon lab — Task 1.2

Tip

Transform counts — apply log1p to the OTU matrix; optionally plot library size by Label.

Lab exercises · Solution · Download Rmd

Part D - Gene × disease (binary)

Another synthetic data set

Binary response variable, indicating the presence or absence of a disease trait
Number of Samples (n): 200
Number of Features (p): 10 genes (named Gene1 to Gene10)
Gene Expression Data: Simulated as normally distributed random values.

Causal Genes:

Gene1: Positive effect with a coefficient of 0.5, plus interaction with Gene3.
Gene3: Positive effect with a coefficient of 0.8, plus interaction with Gene1.
Gene4: Negative effect with a coefficient of -0.3.

Another synthetic data set (cont.)

Response variable generated based on a linear combination of Gene1, Gene3, and Gene4 with some added noise:

disease_presence <- ifelse(
0.5 * Gene1 +  0.8 * Gene3 - 0.3 * Gene4 - 
 2  * Gene1 * Gene3 * + N(0,0.5) > 0, 
1, 0)

Logistic regression

We can predict binary outcomes using logistic regression

Outcome y \in \{0, 1\} (here: disease absent / present)
Model log-odds as linear in predictors: \log\!\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots
Gives probabilities in (0, 1)
Classify with a threshold probability (often 0.5)
Same modeling workflow as in the qualitative-trait notebook: null → full → AIC → lasso on all samples

Notebook: logistic-regression-gene-disease.Rmd

Afternoon lab — Task 1.3

Tip

Explore with PCA — prcomp on scaled log1p OTUs; plot PC1 vs PC2 coloured by Label.

Lab exercises · Solution · Download Rmd

Full logistic model: what it misses

Cannot capture the Gene1 × Gene3 interaction (misses Gene1’s role)
Picks some signal genes but also a noise gene (Gene8)

Null model (intercept only)

Predicts the overall disease rate - ignores role of genes → underfitting.

Null GLM: intercept only (overall disease rate)
term	estimate	std.error	statistic	p.value
(Intercept)	0.2819	0.1428	1.9734	0.0485

Single intercept; no gene effects in the model.

Full logistic model (all genes)

Logistic decision boundary (Gene1 × Gene3)

Smooth P(disease) regions - compare to axis-aligned tree partitions on Tuesday (white line = P = 0.5).

Afternoon lab — Task 1.4

Tip

Logistic regression on PCs — Early vs Late — glm(Label ~ PC1 + …, family = binomial) on a few principal components (not raw OTUs).

Lab exercises · Solution · Download Rmd

Null vs full: in-sample misclassification

AIC: backward and forward selection

Backward AIC formula: disease_presence ~ Gene1 + Gene3 + Gene4 + Gene8

Forward AIC formula: disease_presence ~ Gene3 + Gene4 + Gene8 + Gene1

Afternoon lab — Task 1.5

Tip

Stepwise AIC — Early vs Late — MASS::stepAIC on the PC logistic model; compare to the full PC formula.

Lab exercises · Solution · Download Rmd

Lasso for classification

Same \lambda and LOO-CV story as the trait example (cross-validation) — glmnet(..., family = "binomial", nfold = n)
Then compare null, full glm, AIC models, and lasso in one in-sample summary table

Lasso logistic: cross-validated choice of \lambda

Afternoon lab — Task 1.6

Tip

Lasso — Early vs Late — glmnet (binomial) on all log1p OTUs; compare sparse OTUs to the AIC model (no PCA needed for lasso).

Lab exercises · Solution · Download Rmd

Summary: all approaches (in-sample, same data)

In-sample classification on all n samples.
Model	n_coef	AIC	Accuracy_in	Error_rate_in
Null (intercept only)	1	275.326	0.570	0.430
GLM full	11	249.190	0.740	0.260
AIC backward	5	241.608	0.750	0.250
AIC forward	5	241.608	0.750	0.250
Lasso (lambda.1se, LOO-CV)	3	NA	0.675	0.325

Afternoon lab — Task 1.7

Tip

Repeat for Sex — run tasks 1.4–1.6 with outcome Sex (F vs M).

Lab exercises · Solution · Download Rmd

Elastic net (bridge between ridge and lasso)

glmnet uses mixing parameter \alpha \in [0, 1]: \alpha = 0 is pure ridge, \alpha = 1 is pure lasso, values in between blend L_1 and L_2 penalties.
Useful when you want some grouping/shrinkage of correlated predictors (ridge-like) but still allow some exact zeros (lasso-like).

Part E - Wrap-up

Afternoon lab — Task 1.8

Tip

Compare approaches — summary table: method × outcome × # predictors × in-sample accuracy; one sentence on PCA+glm vs lasso.

Lab exercises · Solution · Download Rmd

Student notebooks (Day 1 focus)

Have a look at notebooks. Open one and try to render in RStudio.

End-of-day checklist

Can you explain null vs full on the same data - which underfits, which risks overfitting?
Can you contrast discrete model search (AIC) with continuous shrinkage (lasso)?
What does \lambda do? Contrast leave-one-out vs k-fold CV for choosing it.
Did you open at least one student .Rmd and run all chunks?
Does the notebook make sense to you?
Do you understand what is happening in the notebooks?

Day 1 (Monday): Foundations and Regularization

Navigation

Part A - Setup

Tools we use this week

Overview

Learning objectives

R and RStudio

RStudio layout (four panes)

Install R, then RStudio

Scripts: save your work

Old Faithful - our ggplot demo data

Load data: built-in table

First look: glimpse()

ggplot2 Grammar

Example plot - scatter

Example plot - add a layer (geom_smooth)

Notebook exercise

Bridge: from tooling to modeling

Afternoon lab — Task 1.1

Part B - Classical linear models

Classical statistics

Classical vs predictive wording

Multiple linear regression

Key assumptions

One explanatory variable: line + scatter

Fit the model

Q-Q plot of residuals

Residuals vs fitted values

Two predictors: a plane in 3D

The meaning of regression coefficients \hat{\beta}_j

The data (MASS::cement)

Standardize predictors first

One predictor at a time vs all four together

Correlations among x1-x4

Takeaway

Bridge: from toy regression to high-dimensional genes

Part C - Gene Expression (continuous response variable)

A synthetic dataset

Gene × trait: the dataset

GeneA × GeneB - ground-truth surface

Pairwise Plots

Overfitting vs underfitting

Null model (intercept only)

Full model (all genes)

Discrete complexity control: AIC stepwise search

Null vs full: in-sample fit

AIC: backward and forward selection

AIC vs full OLS: coefficients (+/- 2 SE)

Bridge: from AIC to lasso

Lasso as an alternative to discrete search

Why cross-validation?

The CV loop (in words)

Leave-one-out CV (LOO-CV)

LOO-CV animation

k-fold CV

k-fold CV schematic

Lasso: pick \lambda with LOO-CV

Lasso: CV error vs \lambda (from cv.glmnet)

Lasso coefficient paths

Summary: all approaches (in-sample, same data)

Afternoon lab — Task 1.2

Part D - Gene × disease (binary)

Another synthetic data set

Another synthetic data set (cont.)

Logistic regression

Afternoon lab — Task 1.3

Full logistic model: what it misses

Null model (intercept only)

Full logistic model (all genes)

Logistic decision boundary (Gene1 × Gene3)

Afternoon lab — Task 1.4

Null vs full: in-sample misclassification

AIC: backward and forward selection

Afternoon lab — Task 1.5

Lasso for classification

Lasso logistic: cross-validated choice of \lambda

Afternoon lab — Task 1.6

Summary: all approaches (in-sample, same data)

Afternoon lab — Task 1.7

Elastic net (bridge between ridge and lasso)

First look: `glimpse()`

Example plot - add a layer (`geom_smooth`)

The data (`MASS::cement`)

Correlations among `x1`-`x4`

Lasso: CV error vs \lambda (from `cv.glmnet`)