.qmd / .Rmddplyr, ggplot2, …); you will see the pipe |> in notebookslm / glmpalmerpenguins before Tuesday - we use it in the penguin notebooks, not in today’s gene simulationsDetails and installs: Student notebooks
tidymodels pipelineggplot2 figurecv.glmnet) - not a final test-set lockbox todayRStudio / Positron intro for Monday slides. Packages: tidyverse (base R faithful for a quick demo).
Source / Editor - write .R scripts; run lines with shortcuts |
Environment - data frames and objects you have created |
| Console - type commands; see output and errors | Files / Plots / Packages / Help - browse files; view figures; install packages |
Figure 1
# are comments (notes for humans; R ignores them).R file and re-run from the top when you change something - that is reproducibility in miniatureThe built-in faithful table records eruption duration and waiting time between eruptions at Yellowstone (minutes). Data card
Photo: Old Faithful geyser, Yellowstone (Wikimedia Commons, CC BY-SA 4.0).
Any rectangular table works. Here we use faithful - no extra package required:
Also common in projects: a file in your project folder, read with read_csv() from readr (part of tidyverse):
Paths are relative to the project root when you use an RStudio Project.
glimpse()Data and aesthetic mappings (aes: x, y, colour, …) +
geometric layers (geom_point, geom_smooth, …) +
labs() +
theme_*(),
Combine different steps with + at end of line, new command can start in next line.
geom_smooth)Each + adds another layer or adjustment:
ggplot(faithful_df, aes(x = eruptions, y = waiting)) +
geom_point(alpha = 0.6, size = 1.8, color = "gray40") +
geom_smooth(method = "lm", se = TRUE, color = "firebrick", linewidth = 0.8) +
labs(
title = "Same data with a linear smooth",
x = "Eruption duration (min)",
y = "Waiting time (min)"
) +
theme_minimal()Open any notebook from the notebook index and try Render in RStudio.
You now have the R/tidyverse toolkit for loading data and plotting. Part B switches to classical linear models: what we estimate, how we interpret coefficients, and when plain OLS struggles.
Tip
Load and prepare the data — merge metadata and OTU table on sample_id; transpose to samples × OTUs; prevalence filter; inspect Label, Sex, and Individual.
Data are noisy; models are simplifying mathematical statements about patterns, not automatic truths about biology
We estimate unknown quantities and (when the design supports it) quantify uncertainty (standard errors, intervals)
Association \neq causation - a significant coefficient is still conditional on everything else in the model and on what was actually measured
Classical question: Given a regression model, what do we learn about the slope \beta_j, and how fragile is it?
Answer is usually an estimate plus a confidence interval.
Prediction question: If we freeze this fitted equation, how well does it score on new data points?
Answer is usually a model and its performance on new unseen data.
Today: all models are fit and compared on the same simulated samples; Tuesday introduces held-out evaluation (i.e., on new, unseen data)
additivity in the chosen predictor columns
Each vertical “slice” is a Normal density centered on the fitted line - spread of y at a given x.
Fitting is done by minimizing sum of squared residuals (oridnary least squares, OLS).
Figure 2
Checks whether residuals look roughly Normal (one assumption behind classical intervals).
Look for pattern (curve, funnel shape) - not just random scatter around zero.
Multiple regression with two predictors is a plane in (x_1, x_2, y) space (more predictors → hyperplane).
MASS::cement)| x1 | x2 | x3 | x4 | y |
|---|---|---|---|---|
| 7 | 26 | 6 | 60 | 78.5 |
| 1 | 29 | 15 | 52 | 74.3 |
| 11 | 56 | 8 | 20 | 104.3 |
| 11 | 31 | 8 | 47 | 87.6 |
| 7 | 52 | 6 | 33 | 95.9 |
y: heat evolved (calories per gram cement) - left on the original scalex1-x4: oxide composition - we standardize them (scale()) so slopes are comparable and units do not dominate the fit| predictor | simple_estimate | simple_p | full_estimate | full_p | simple_sig | full_sig |
|---|---|---|---|---|---|---|
| x1 | 10.992711 | 0.004552 | 9.124198 | 0.070822 | yes | no |
| x2 | 12.279477 | 0.000665 | 7.938657 | 0.500901 | yes | no |
| x3 | -8.043437 | 0.059762 | 0.652743 | 0.895923 | no | no |
| x4 | -12.355485 | 0.000576 | -2.411319 | 0.844071 | yes | no |
x1-x4Strong negative pairs: x2-x4 (r \approx -0.97), x1-x3 (r \approx -0.82). x3-x4 are nearly uncorrelated (r \approx 0.03); x1-x4 weakly negative (r \approx -0.25).
Multiple linear regression is not the same as multiple simple linear regressions
Signs and p-values can flip when you add or drop correlated columns
Which is the “correct” model?
Later today: lasso and AIC help navigate this kind of data
Part C keeps a continuous trait but adds many correlated predictors — the setting where AIC stepwise and lasso matter.
trait = -0.5 * GeneA + 0.8 * GeneB + N(0, 0.3)Distribution of explanatory variables (gene expression levels) and response variable (quantitative trait)
Figure 3
Underfitting: model too simple - high error even on data used to fit (e.g. intercept-only, ignores genes)
Overfitting risk: model very flexible - can fit noise; unstable coefficients when predictors overlap
Null model (trait ~ 1): one parameter, ignores all genes → underfitting
Full model (trait ~ .): all genes → lower training error, but many correlated predictors → shaky \hat{\beta}_j
Goal: find a complexity level that balances fit and stability
Too simple: ignores genes → underfitting.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.0352 | 0.0742 | 0.4747 | 0.6357 |
Single intercept; no gene effects in the model.
Every gene in the model - flexible, but many correlated predictors → unstable coefficients (overfitting risk).
AIC trades log-likelihood against model size: \mathrm{AIC} = 2k - 2\log L. Lower is better on the same data.
Backward starts at the full model and drops terms if AIC improves.
Forward starts at the intercept-only model and adds terms greedily up to the full model.
Backward AIC formula: trait ~ GeneA + GeneB + GeneC + GeneD
Forward AIC formula: trait ~ GeneB + GeneA + GeneD + GeneC
Dashed lines: known effects for GeneB (0.8) and GeneA (-0.5).
AIC searches over discrete subsets of predictors. Lasso keeps one smooth penalty knob (\lambda) and often sets many coefficients to exactly zero.
cv.glmnet and the CV error curvefit(model, train) on training rows only (for lasso: at a fixed \lambda).predict(model, validation) on held-out rows; score error (MSE or deviance).cv.glmnet).cv.glmnet(..., nfold = n).Animation: LOOCV.gif by MBanuelos22, Wikimedia Commons, CC BY-SA 4.0.
vfold_cv() in tidymodels automates this for trees and other models.Figure: K-fold cross validation EN.svg by Gufosowa, Wikimedia Commons, CC BY-SA 4.0.
cv.glmnet averages LOO prediction error.lambda.min: lowest CV error; lambda.1se: simpler model within one standard error of the minimum (our default).cv.glmnet — same idea, different k. Archived write-up: manual k-fold CV.cv.glmnet)In-sample comparison on all n samples.
| Model | n_coef | AIC | RMSE_in | R2_in |
|---|---|---|---|---|
| Null (intercept only) | 1 | 399.889 | 0.905 | 0.000 |
| OLS full | 11 | 82.346 | 0.294 | 0.895 |
| AIC backward | 5 | 73.024 | 0.297 | 0.893 |
| AIC forward | 5 | 73.024 | 0.297 | 0.893 |
| Lasso (lambda.1se, LOO-CV) | 3 | NA | 0.314 | 0.890 |
Tip
Transform counts — apply log1p to the OTU matrix; optionally plot library size by Label.
Causal Genes:
Response variable generated based on a linear combination of Gene1, Gene3, and Gene4 with some added noise:
We can predict binary outcomes using logistic regression
Notebook: logistic-regression-gene-disease.Rmd
Tip
Explore with PCA — prcomp on scaled log1p OTUs; plot PC1 vs PC2 coloured by Label.
Predicts the overall disease rate - ignores role of genes → underfitting.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.2819 | 0.1428 | 1.9734 | 0.0485 |
Single intercept; no gene effects in the model.
Smooth P(disease) regions - compare to axis-aligned tree partitions on Tuesday (white line = P = 0.5).
Tip
Logistic regression on PCs — Early vs Late — glm(Label ~ PC1 + …, family = binomial) on a few principal components (not raw OTUs).
Backward AIC formula: disease_presence ~ Gene1 + Gene3 + Gene4 + Gene8
Forward AIC formula: disease_presence ~ Gene3 + Gene4 + Gene8 + Gene1
Tip
Stepwise AIC — Early vs Late — MASS::stepAIC on the PC logistic model; compare to the full PC formula.
glmnet(..., family = "binomial", nfold = n)glm, AIC models, and lasso in one in-sample summary tableTip
Lasso — Early vs Late — glmnet (binomial) on all log1p OTUs; compare sparse OTUs to the AIC model (no PCA needed for lasso).
| Model | n_coef | AIC | Accuracy_in | Error_rate_in |
|---|---|---|---|---|
| Null (intercept only) | 1 | 275.326 | 0.570 | 0.430 |
| GLM full | 11 | 249.190 | 0.740 | 0.260 |
| AIC backward | 5 | 241.608 | 0.750 | 0.250 |
| AIC forward | 5 | 241.608 | 0.750 | 0.250 |
| Lasso (lambda.1se, LOO-CV) | 3 | NA | 0.675 | 0.325 |
Tip
Repeat for Sex — run tasks 1.4–1.6 with outcome Sex (F vs M).
glmnet uses mixing parameter \alpha \in [0, 1]: \alpha = 0 is pure ridge, \alpha = 1 is pure lasso, values in between blend L_1 and L_2 penalties.Tip
Compare approaches — summary table: method × outcome × # predictors × in-sample accuracy; one sentence on PCA+glm vs lasso.
Have a look at notebooks. Open one and try to render in RStudio.
.Rmd and run all chunks?