Data card — synthetic gene x trait

How to load

Generated in R (same logic as the notebook). Minimal version:

set.seed(420)
n <- 150
p <- 10
gene_names <- c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE",
                "GeneF", "GeneG", "GeneH", "GeneI", "GeneJ")
gene_data <- matrix(rnorm(n * p), nrow = n)
colnames(gene_data) <- gene_names
trait <- -0.5 * gene_data[, "GeneA"] + 0.8 * gene_data[, "GeneB"] + rnorm(n, sd = 0.3)
gene_data[, "GeneC"] <- gene_data[, "GeneA"] + rnorm(n, sd = 0.2)
gene_data[, "GeneD"] <- -0.3 * gene_data[, "GeneB"] + rnorm(n, sd = 0.2)
gene_data[, "GeneJ"] <- 0.7 * gene_data[, "GeneA"] + rnorm(n, sd = 0.2)
data_trait <- as.data.frame(cbind(gene_data, trait))

Where we use this

Caveats

  • Fully simulated — no human or animal measurements.
  • Ground truth known — GeneA and GeneB are causal for trait; GeneC, GeneD, GeneJ are correlated noise; others are irrelevant.

Data card

1. Source

Simulated in R (set.seed(420)). Script matches linear-regression-lasso.Rmd. No external file download.

2. Outcome

  • trait: continuous (arbitrary units), numeric.

3. Predictors

Ten gene expression columns GeneAGeneJ, standard normal before correlations are imposed:

Gene Role in data-generating process
GeneA, GeneB Causal for trait
GeneC Correlated with GeneA (not causal)
GeneD Correlated with GeneB (not causal)
GeneJ Strongly correlated with GeneA (not causal)
GeneE–GeneI Mostly noise; some pairwise correlation among E–H

Models may standardize predictors (LASSO, slides).

4. Sample size

n = 150. No missing values in the simulation.

5. Leakage risks

  • Reporting in-sample R² or RMSE on the same rows used to fit or tune (LASSO λ, AIC) overstates performance.
  • Train/test split and CV inside training (Day 2 refactor lab) are the fix.
  • Correlated noise genes can enter OLS or trees without causal meaning — discuss interpretation.

6. Subgroups

Not applicable in simulation. In real omics you would record batch, sex, and site.


Back to all datasets · Lab exercises