Datasets and data cards

Each dataset used in the course has a data card: the same six sections students draft in the microbiome data card (source, outcome, predictors, sample size, leakage risks, subgroups). Use these pages as instructor references or starting points for your own one-pager.

Adding a dataset later

Add an entry to datasets.yml.
Create cards/your-dataset.qmd using the six sections below any other card.
Add a row to the catalog table on this page.

Catalog

Dataset	Kind	Card	Used for
Old Faithful eruptions	Real (base R)	Open card	Day 1 ggplot intro
Hald cement	Real (classic)	Open card	Day 1 correlated predictors
Synthetic gene x trait	Simulated	Open card	Day 1-2 regression, LASSO, trees
Synthetic gene x disease	Simulated	Open card	Day 1-2 logistic, trees
Palmer Penguins	Real	Open card	Notebooks, modules, Day 2-4 slides
Mouse 16S microbiome	Real	Open card	Day 2 and Day 4 afternoon labs

Real data

Palmer Penguins

Our main real running example all week: regression, classification, pipelines, metrics.

Open data card

Simulated

Gene simulations

We wrote the rules so we know which genes should matter (trait and disease tracks).

Trait card · Disease card

Labs

Microbiome (mouse 16S)

Repeated measures per mouse; grouped CV is required.

Open data card

Data card template (student version)

Source - where did rows come from?
Outcome - what are you predicting? units / levels?
Predictors - what measurements? how many?
Sample size - n; any missing values?
Leakage risks - repeated measures? normalization on full table? selection bias?
Subgroups - sex, batch, island, individual ID - for later fairness checks

Example student answer: Day 1 gene simulation data card (archived).