Student notebooks
Run these documents in RStudio or Positron (or render with Quarto). The site uses two parallel tracks: simulated genes (we designed the data so we know the truth) and Palmer Penguins (one real dataset all week). See the data catalog and Palmer Penguins data card for context.
- These
.Rmdfiles are long labs: lots of chunks, plots, and “change this and re-run.” - The reading companion gives prose-first chapter explanations; Chapters 4, 7, and 8 include the canonical
tidymodelspipeline path.
Notebook inventory
Tip: use Render in the IDE, or from the project root run quarto render notebooks/your-file.Rmd.
Synthetic genes (we know the answers)
Ideal for LASSO, ridge, AIC, and logistic regression when we know which simulated features should matter.
- linear-regression-lasso.Rmd — gene–trait simulation, plots, penalized regression (
glmnetpackage), forward/backward model choice by AIC (MASS::stepAICin the code). - logistic-regression-gene-disease.Rmd — simulated gene expression and disease status,
glmlogistic regression,rparttree, confusion table, decision-boundary plot.
Palmer Penguins (continuity)
Same real table as on the Palmer Penguins data card. Models: lm / glm / rpart / glmnet. Splits, scaling, and metrics: tidymodels (rsample, recipes, yardstick).
For the Day 2 (Tuesday) tidymodels pipeline (recipe + workflow + cross-validation / tuning on the same Adelie vs Gentoo task), use Chapter 4 (starts with train/test + glm) and Chapter 4 — canonical pipeline (shared tuned-tree walk-through with Day 2 slides), then Chapter 7 (accuracy vs kappa vs AUC while tuning) and Chapter 8 (RF vs XGBoost vs small neural net).
- penguins-body-mass.Rmd — plots and correlations, then ridge/LASSO to predict body mass (code uses
glmnetlike the gene notebook).
- penguins-canonical-pipeline.Rmd — minimal
tidymodelspipeline: one recipe, three models (tree / logistic / random forest), three CV metrics, one comparison figure.
- penguins-classification.Rmd — Adelie vs Gentoo: logistic regression and a tree (
rpart), who-was-misclassified table, simple 2D decision picture.
- penguins-sex-classification.Rmd — female vs male from size measures (species left out on purpose), train/test split, same two model types.
- penguins-species-multiclass.Rmd — three species: multinomial logistic regression (
nnet::multinom) and a multiclass tree, confusion tables and a heatmap.
- penguins-multiclass-upsampling.Rmd — rare Chinstrap + one tree model: compare confusion matrices with and without
step_upsample(). - penguins-shap.Rmd — SHAP notebook: beeswarm with individual points, mean |SHAP| ranking, and waterfall plots for individual penguins.
- penguins-pca.Rmd — PCA notebook: PCA biplot/loadings and compact with-vs-without PCA model comparison.
Requirements
Install as needed (copy–paste into R): palmerpenguins, tidyr, nnet, glmnet, tidymodels, dplyr, GGally, ggplot2, knitr, MASS, rpart, rpart.plot, ranger, xgboost, themis (Day 4 Thursday slides), rcartocolor (optional colours in the gene linear notebook).