Chapter 1: Learning from measurements

When people first learn machine learning, they often focus on algorithms. In practice, the bigger shift is learning how to ask and answer predictive questions honestly.

This chapter introduces that habit. The goal is simple: learn from one part of your data, then test whether the same pattern survives on data the model has not seen.

We use the same two tracks as the rest of the course: simulated genes (where we know what is true) and Palmer Penguins (where we have to reason from real measurements).

← Reading companion · Next: Chapter 2 — When coefficients need restraint →


What “generalization” means

A dataset has rows (birds, patients, samples) and columns (measurements). Supervised learning means predicting one column (the outcome) from the others (the predictors).

The model can always look good on the rows it trained on. The real question is whether it stays useful on new rows from the same process. That is generalization.

For this reason, we split work into training, validation, and final test logic. Tuesday’s workflow chapter shows this in code with tidymodels, but the principle starts here.


A quick leakage check

Before trusting any metric, ask four questions:

  1. Did I split before learning preprocessing parameters?
  2. Did I tune only on training/CV rows?
  3. Is repeated structure handled correctly (for example, grouped CV when needed)?
  4. Is the final score from untouched test rows?

Too simple, too wiggly, or useful

Students often hear “overfitting” as jargon. A practical picture is easier: a model can be too simple to capture the signal, too flexible and chase noise, or somewhere in the middle.

On Monday, the null model underfits gene tasks, while fully flexible fits can become unstable. On Tuesday, deep trees can memorize tiny quirks in penguins unless tuned with CV.

The point is not to find a perfect model. The point is to keep a defensible tradeoff between fit quality and stability.


Data card habit (one page)

A lightweight data card can prevent many downstream mistakes. Write, in plain sentences:

  • where rows came from,
  • what the outcome means,
  • where missing values appear,
  • what leakage risks exist,
  • which subgroup fields you can audit.

Use the data catalog as reference and carry the same discipline into your Thursday model-card discussion.

Tip

Generalization is a workflow promise, not a metric trick. If you separate training from evaluation, document data limits, and state what you cannot claim, your model discussion becomes stronger even before you optimize anything.


Lecture: Day 1 (Monday) slides
Notebook: logistic-regression-gene-disease.Rmd
Next: Chapter 2 — When coefficients need restraint