Palmer Penguins — predicting sex (binary)

Author

Aparna Pandey and Stephan Peischl

Overview

This notebook predicts sex (female vs male) from morphometrics and island / year only. We deliberately omit species as a predictor so the task is not trivially solved by species–sex composition differences alone (in exercises you can add species and compare accuracy — then discuss leakage and what you are willing to claim scientifically).

Models: glm and rpart. Train/test split and confusion tables: tidymodels (initial_split, yardstick::conf_mat). For a full Adelie-vs-Gentoo pipeline, see Module 04.

See Palmer Penguins data card and the related notebooks (mass, species, multiclass). For SHAP on this task (without species), see Module 06 and Day 4 slides.

Prepare data (complete cases on `sex`)

data("penguins", package = "palmerpenguins")
pg <- penguins |>
  tidyr::drop_na(sex, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, island, year) |>
  mutate(
    sex = droplevels(sex),
    year = as.numeric(year)
  ) |>
  dplyr::select(-species)

table(pg$sex)


female   male 
   165    168

nrow(pg)

[1] 333

Pair plot (measurements coloured by sex)

GGally::ggpairs(
  pg,
  columns = c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"),
  aes(color = sex, alpha = 0.35)
) +
  theme_minimal()

Train / test split (stratified on `sex`)

set.seed(11)
split <- initial_split(pg, prop = 0.75, strata = sex)
train <- training(split)
test <- testing(split)

Logistic regression (no `species` in formula)

log_fit <- glm(
  sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + island + year,
  data = train,
  family = binomial()
)
summary(log_fit)


Call:
glm(formula = sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
    body_mass_g + island + year, family = binomial(), data = train)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -5.308e+02  6.001e+02  -0.885   0.3764    
bill_length_mm     1.186e-01  6.293e-02   1.885   0.0594 .  
bill_depth_mm      2.082e+00  3.025e-01   6.885 5.78e-12 ***
flipper_length_mm -1.680e-02  4.251e-02  -0.395   0.6928    
body_mass_g        5.056e-03  9.864e-04   5.125 2.97e-07 ***
islandDream       -2.113e-01  7.524e-01  -0.281   0.7788    
islandTorgersen   -8.101e-02  8.358e-01  -0.097   0.9228    
year               2.351e-01  2.995e-01   0.785   0.4325    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 345.15  on 248  degrees of freedom
Residual deviance: 120.79  on 241  degrees of freedom
AIC: 136.79

Number of Fisher Scoring iterations: 7

p_test <- predict(log_fit, newdata = test, type = "response")
pred_sex <- factor(ifelse(p_test > 0.5, levels(train$sex)[2], levels(train$sex)[1]), levels = levels(train$sex))
tibble(truth = test$sex, .pred_class = pred_sex) |>
  conf_mat(truth = truth, estimate = .pred_class)

          Truth
Prediction female male
    female     41    6
    male        1   36

Classification tree

tree_fit <- rpart(
  sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + island + year,
  data = train,
  method = "class"
)
rpart.plot(tree_fit, type = 4, extra = 104, main = "Sex from morphology (training data)")

pred_t <- predict(tree_fit, test, type = "class") |> factor(levels = levels(test$sex))
tibble(truth = test$sex, .pred_class = pred_t) |>
  conf_mat(truth = truth, estimate = .pred_class)

          Truth
Prediction female male
    female     39    5
    male        3   37

2D view (bill length vs flipper; other predictors at training medians / modes)

plot_boundary <- function(model, data, f1, f2) {
  r1 <- range(data[[f1]])
  r2 <- range(data[[f2]])
  grid <- expand.grid(
    seq(r1[1], r1[2], length.out = 100),
    seq(r2[1], r2[2], length.out = 100)
  )
  names(grid) <- c(f1, f2)
  for (nm in setdiff(names(data), c(f1, f2, "sex"))) {
    v <- data[[nm]]
    grid[[nm]] <- if (is.numeric(v)) stats::median(v, na.rm = TRUE) else {
      tab <- table(v)
      names(tab)[which.max(tab)]
    }
  }
  grid$p_male <- predict(model, newdata = grid, type = "response")
  lv <- levels(data$sex)
  grid$cls <- factor(ifelse(grid$p_male > 0.5, lv[2], lv[1]), levels = lv)
  ggplot(data, aes(!!sym(f1), !!sym(f2))) +
    geom_raster(data = grid, aes(!!sym(f1), !!sym(f2), fill = cls), alpha = 0.22, inherit.aes = FALSE) +
    geom_point(aes(shape = sex, fill = sex), size = 2.2, color = "gray25") +
    scale_fill_brewer(palette = "Set1") +
    theme_minimal() +
    labs(
      title = "Logistic regions for sex (bill vs flipper)",
      subtitle = "Other predictors fixed; species not used as input",
      fill = "Region", shape = "Sex"
    )
}
print(plot_boundary(log_fit, train, "bill_length_mm", "flipper_length_mm"))

Takeaways

Missing sex shrinks usable n — discuss NA reporting and whether imputation is defensible.
Omitting species makes errors more interesting; adding it back is a one-line ablation for discussion.
Compare difficulty here to Adelie vs Gentoo in penguins-classification.Rmd.

--- title: "Palmer Penguins — predicting sex (binary)" author: "Aparna Pandey and Stephan Peischl" format: html: toc: true code-tools: true engine: knitr --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) library(palmerpenguins) library(dplyr) library(ggplot2) library(GGally) library(rpart) library(rpart.plot) library(tidymodels) library(tidyr) library(rlang) theme_set(theme_classic()) ``` # Overview This notebook predicts **`sex`** (female vs male) from **morphometrics and island / year only**. We **deliberately omit `species`** as a predictor so the task is not trivially solved by species–sex composition differences alone (in exercises you can add `species` and compare accuracy — then discuss **leakage** and what you are willing to claim scientifically). **Models:** `glm` and `rpart`. **Train/test split and confusion tables:** `tidymodels` (`initial_split`, `yardstick::conf_mat`). For a full Adelie-vs-Gentoo pipeline, see [Module 04](../modules/module-04-pipeline.qmd#train-test-last-fit). See **[Palmer Penguins data card](../data/cards/palmer-penguins.qmd)** and the related notebooks (mass, species, multiclass). For **SHAP** on this task (without `species`), see [Module 06](../modules/module-06-evaluation-and-interpretability.qmd#shap-shapley-values-on-penguins-sex) and [Day 4 slides](../slides/day-04-thursday.html). ## Prepare data (complete cases on `sex`) ```{r} data("penguins", package = "palmerpenguins") pg <- penguins |> tidyr::drop_na(sex, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, island, year) |> mutate( sex = droplevels(sex), year = as.numeric(year) ) |> dplyr::select(-species) table(pg$sex) nrow(pg) ``` ## Pair plot (measurements coloured by sex) ```{r fig.width=8, fig.height=5} GGally::ggpairs( pg, columns = c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"), aes(color = sex, alpha = 0.35) ) + theme_minimal() ``` ## Train / test split (stratified on `sex`) ```{r} set.seed(11) split <- initial_split(pg, prop = 0.75, strata = sex) train <- training(split) test <- testing(split) ``` ## Logistic regression (no `species` in formula) ```{r} log_fit <- glm( sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + island + year, data = train, family = binomial() ) summary(log_fit) ``` ```{r} p_test <- predict(log_fit, newdata = test, type = "response") pred_sex <- factor(ifelse(p_test > 0.5, levels(train$sex)[2], levels(train$sex)[1]), levels = levels(train$sex)) tibble(truth = test$sex, .pred_class = pred_sex) |> conf_mat(truth = truth, estimate = .pred_class) ``` ## Classification tree ```{r fig.width=8, fig.height=5.5} tree_fit <- rpart( sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + island + year, data = train, method = "class" ) rpart.plot(tree_fit, type = 4, extra = 104, main = "Sex from morphology (training data)") ``` ```{r} pred_t <- predict(tree_fit, test, type = "class") |> factor(levels = levels(test$sex)) tibble(truth = test$sex, .pred_class = pred_t) |> conf_mat(truth = truth, estimate = .pred_class) ``` ## 2D view (bill length vs flipper; other predictors at training medians / modes) ```{r fig.width=7, fig.height=4.5} plot_boundary <- function(model, data, f1, f2) { r1 <- range(data[[f1]]) r2 <- range(data[[f2]]) grid <- expand.grid( seq(r1[1], r1[2], length.out = 100), seq(r2[1], r2[2], length.out = 100) ) names(grid) <- c(f1, f2) for (nm in setdiff(names(data), c(f1, f2, "sex"))) { v <- data[[nm]] grid[[nm]] <- if (is.numeric(v)) stats::median(v, na.rm = TRUE) else { tab <- table(v) names(tab)[which.max(tab)] } } grid$p_male <- predict(model, newdata = grid, type = "response") lv <- levels(data$sex) grid$cls <- factor(ifelse(grid$p_male > 0.5, lv[2], lv[1]), levels = lv) ggplot(data, aes(!!sym(f1), !!sym(f2))) + geom_raster(data = grid, aes(!!sym(f1), !!sym(f2), fill = cls), alpha = 0.22, inherit.aes = FALSE) + geom_point(aes(shape = sex, fill = sex), size = 2.2, color = "gray25") + scale_fill_brewer(palette = "Set1") + theme_minimal() + labs( title = "Logistic regions for sex (bill vs flipper)", subtitle = "Other predictors fixed; species not used as input", fill = "Region", shape = "Sex" ) } print(plot_boundary(log_fit, train, "bill_length_mm", "flipper_length_mm")) ``` ## Takeaways - **Missing `sex`** shrinks usable *n* — discuss NA reporting and whether imputation is defensible. - Omitting **`species`** makes errors more interesting; adding it back is a one-line **ablation** for discussion. - Compare difficulty here to **Adelie vs Gentoo** in `penguins-classification.Rmd`.