Datasets and data cards
Each dataset used in the course has a data card: the same six sections students draft in the microbiome data card (source, outcome, predictors, sample size, leakage risks, subgroups). Use these pages as instructor references or starting points for your own one-pager.
NoteAdding a dataset later
- Add an entry to
datasets.yml.
- Create
cards/your-dataset.qmdusing the six sections below any other card.
- Add a row to the catalog table on this page.
Catalog
| Dataset | Kind | Card | Used for |
|---|---|---|---|
| Old Faithful eruptions | Real (base R) | Open card | Day 1 ggplot intro |
| Hald cement | Real (classic) | Open card | Day 1 correlated predictors |
| Synthetic gene x trait | Simulated | Open card | Day 1-2 regression, LASSO, trees |
| Synthetic gene x disease | Simulated | Open card | Day 1-2 logistic, trees |
| Palmer Penguins | Real | Open card | Notebooks, modules, Day 2-4 slides |
| Mouse 16S microbiome | Real | Open card | Day 2 and Day 4 afternoon labs |
Real data
Palmer Penguins
Our main real running example all week: regression, classification, pipelines, metrics.
Simulated
Gene simulations
We wrote the rules so we know which genes should matter (trait and disease tracks).
Labs
Microbiome (mouse 16S)
Repeated measures per mouse; grouped CV is required.
Data card template (student version)
- Source - where did rows come from?
- Outcome - what are you predicting? units / levels?
- Predictors - what measurements? how many?
- Sample size - n; any missing values?
- Leakage risks - repeated measures? normalization on full table? selection bias?
- Subgroups - sex, batch, island, individual ID - for later fairness checks
Example student answer: Day 1 gene simulation data card (archived).