Datasets and data cards

Each dataset used in the course has a data card: the same six sections students draft in the microbiome data card (source, outcome, predictors, sample size, leakage risks, subgroups). Use these pages as instructor references or starting points for your own one-pager.

NoteAdding a dataset later
  1. Add an entry to datasets.yml.
  2. Create cards/your-dataset.qmd using the six sections below any other card.
  3. Add a row to the catalog table on this page.

Catalog

Dataset Kind Card Used for
Old Faithful eruptions Real (base R) Open card Day 1 ggplot intro
Hald cement Real (classic) Open card Day 1 correlated predictors
Synthetic gene x trait Simulated Open card Day 1-2 regression, LASSO, trees
Synthetic gene x disease Simulated Open card Day 1-2 logistic, trees
Palmer Penguins Real Open card Notebooks, modules, Day 2-4 slides
Mouse 16S microbiome Real Open card Day 2 and Day 4 afternoon labs

Real data

Palmer Penguins

Our main real running example all week: regression, classification, pipelines, metrics.

Open data card

Simulated

Gene simulations

We wrote the rules so we know which genes should matter (trait and disease tracks).

Trait card · Disease card

Labs

Microbiome (mouse 16S)

Repeated measures per mouse; grouped CV is required.

Open data card

Data card template (student version)

  1. Source - where did rows come from?
  2. Outcome - what are you predicting? units / levels?
  3. Predictors - what measurements? how many?
  4. Sample size - n; any missing values?
  5. Leakage risks - repeated measures? normalization on full table? selection bias?
  6. Subgroups - sex, batch, island, individual ID - for later fairness checks

Example student answer: Day 1 gene simulation data card (archived).