Data card — Hald cement

How to load

hald <- datasets::cement   # 13 rows; columns y, x1, x2, x3, x4
# Slides often use z-scored predictors:
hald_scaled <- hald
hald_scaled[, c("x1", "x2", "x3", "x4")] <- scale(hald[, c("x1", "x2", "x3", "x4")])

Where we use this

Day 1 (Monday) slides — correlated predictors, comparing one-predictor vs full lm models
Module 02 — Regularization (collinearity motivation)

Caveats

Tiny n (13) — coefficients and R² are unstable; use for ideas, not inference.
Classic teaching set — predictors are highly correlated; illustrates why full OLS can mislead.

Data card

1. Source

Hald cement data, available in R as datasets::cement (sometimes called the Hald dataset). Historical regression textbook example relating heat evolved during cement hardening to four chemical composition variables.

2. Outcome

y: heat evolved (calories per gram of cement), numeric.

3. Predictors

Four numeric composition variables on the original scale in cement; in slides we often z-score x1–x4 and keep y on the original scale:

Column	Role in class
`x1`, `x2`, `x3`, `x4`	Predictors (correlated)

4. Sample size

13 observations. Complete cases only (no NA in the built-in table).

5. Leakage risks

With n = 13, any in-sample comparison (AIC on the same rows, etc.) is extremely optimistic if reported as “generalization.”
Correlated predictors — selecting “significant” individual predictors without regularization or held-out data can flip with small perturbations.

6. Subgroups

Not applicable (no batch or site fields in this table).

Back to all datasets · Lab exercises