Student notebooks

Run these documents in RStudio or Positron (or render with Quarto). The site uses two parallel tracks: simulated genes (we designed the data so we know the truth) and Palmer Penguins (one real dataset all week). See the data catalog and Palmer Penguins data card for context.

Tip: use Render in the IDE, or from the project root run quarto render notebooks/your-file.Rmd.

Synthetic genes (we know the answers)

Ideal for LASSO, ridge, AIC, and logistic regression when we know which simulated features should matter.

  • linear-regression-lasso.Rmd — gene–trait simulation, plots, penalized regression (glmnet package), forward/backward model choice by AIC (MASS::stepAIC in the code).
  • logistic-regression-gene-disease.Rmd — simulated gene expression and disease status, glm logistic regression, rpart tree, confusion table, decision-boundary plot.

Palmer Penguins (continuity)

Same real table as on the Palmer Penguins data card. Models: lm / glm / rpart / glmnet. Splits, scaling, and metrics: tidymodels (rsample, recipes, yardstick).

For the Day 2 (Tuesday) tidymodels pipeline (recipe + workflow + cross-validation / tuning on the same Adelie vs Gentoo task), use Chapter 4 (starts with train/test + glm) and Chapter 4 — canonical pipeline (shared tuned-tree walk-through with Day 2 slides), then Chapter 7 (accuracy vs kappa vs AUC while tuning) and Chapter 8 (RF vs XGBoost vs small neural net).

  • penguins-body-mass.Rmd — plots and correlations, then ridge/LASSO to predict body mass (code uses glmnet like the gene notebook).
  • penguins-canonical-pipeline.Rmdminimal tidymodels pipeline: one recipe, three models (tree / logistic / random forest), three CV metrics, one comparison figure.
  • penguins-classification.RmdAdelie vs Gentoo: logistic regression and a tree (rpart), who-was-misclassified table, simple 2D decision picture.
  • penguins-sex-classification.Rmdfemale vs male from size measures (species left out on purpose), train/test split, same two model types.
  • penguins-species-multiclass.Rmdthree species: multinomial logistic regression (nnet::multinom) and a multiclass tree, confusion tables and a heatmap.
  • penguins-multiclass-upsampling.Rmdrare Chinstrap + one tree model: compare confusion matrices with and without step_upsample().
  • penguins-shap.Rmd — SHAP notebook: beeswarm with individual points, mean |SHAP| ranking, and waterfall plots for individual penguins.
  • penguins-pca.Rmd — PCA notebook: PCA biplot/loadings and compact with-vs-without PCA model comparison.

Requirements

Install as needed (copy–paste into R): palmerpenguins, tidyr, nnet, glmnet, tidymodels, dplyr, GGally, ggplot2, knitr, MASS, rpart, rpart.plot, ranger, xgboost, themis (Day 4 Thursday slides), rcartocolor (optional colours in the gene linear notebook).