Palmer Penguins — predicting sex (binary)

Author

Aparna Pandey and Stephan Peischl

Overview

This notebook predicts sex (female vs male) from morphometrics and island / year only. We deliberately omit species as a predictor so the task is not trivially solved by species–sex composition differences alone (in exercises you can add species and compare accuracy — then discuss leakage and what you are willing to claim scientifically).

Models: glm and rpart. Train/test split and confusion tables: tidymodels (initial_split, yardstick::conf_mat). For a full Adelie-vs-Gentoo pipeline, see Module 04.

See Palmer Penguins data card and the related notebooks (mass, species, multiclass). For SHAP on this task (without species), see Module 06 and Day 4 slides.

Prepare data (complete cases on sex)

data("penguins", package = "palmerpenguins")
pg <- penguins |>
  tidyr::drop_na(sex, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, island, year) |>
  mutate(
    sex = droplevels(sex),
    year = as.numeric(year)
  ) |>
  dplyr::select(-species)

table(pg$sex)

female   male 
   165    168 
nrow(pg)
[1] 333

Pair plot (measurements coloured by sex)

GGally::ggpairs(
  pg,
  columns = c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"),
  aes(color = sex, alpha = 0.35)
) +
  theme_minimal()

Train / test split (stratified on sex)

set.seed(11)
split <- initial_split(pg, prop = 0.75, strata = sex)
train <- training(split)
test <- testing(split)

Logistic regression (no species in formula)

log_fit <- glm(
  sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + island + year,
  data = train,
  family = binomial()
)
summary(log_fit)

Call:
glm(formula = sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
    body_mass_g + island + year, family = binomial(), data = train)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -5.308e+02  6.001e+02  -0.885   0.3764    
bill_length_mm     1.186e-01  6.293e-02   1.885   0.0594 .  
bill_depth_mm      2.082e+00  3.025e-01   6.885 5.78e-12 ***
flipper_length_mm -1.680e-02  4.251e-02  -0.395   0.6928    
body_mass_g        5.056e-03  9.864e-04   5.125 2.97e-07 ***
islandDream       -2.113e-01  7.524e-01  -0.281   0.7788    
islandTorgersen   -8.101e-02  8.358e-01  -0.097   0.9228    
year               2.351e-01  2.995e-01   0.785   0.4325    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 345.15  on 248  degrees of freedom
Residual deviance: 120.79  on 241  degrees of freedom
AIC: 136.79

Number of Fisher Scoring iterations: 7
p_test <- predict(log_fit, newdata = test, type = "response")
pred_sex <- factor(ifelse(p_test > 0.5, levels(train$sex)[2], levels(train$sex)[1]), levels = levels(train$sex))
tibble(truth = test$sex, .pred_class = pred_sex) |>
  conf_mat(truth = truth, estimate = .pred_class)
          Truth
Prediction female male
    female     41    6
    male        1   36

Classification tree

tree_fit <- rpart(
  sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g + island + year,
  data = train,
  method = "class"
)
rpart.plot(tree_fit, type = 4, extra = 104, main = "Sex from morphology (training data)")

pred_t <- predict(tree_fit, test, type = "class") |> factor(levels = levels(test$sex))
tibble(truth = test$sex, .pred_class = pred_t) |>
  conf_mat(truth = truth, estimate = .pred_class)
          Truth
Prediction female male
    female     39    5
    male        3   37

2D view (bill length vs flipper; other predictors at training medians / modes)

plot_boundary <- function(model, data, f1, f2) {
  r1 <- range(data[[f1]])
  r2 <- range(data[[f2]])
  grid <- expand.grid(
    seq(r1[1], r1[2], length.out = 100),
    seq(r2[1], r2[2], length.out = 100)
  )
  names(grid) <- c(f1, f2)
  for (nm in setdiff(names(data), c(f1, f2, "sex"))) {
    v <- data[[nm]]
    grid[[nm]] <- if (is.numeric(v)) stats::median(v, na.rm = TRUE) else {
      tab <- table(v)
      names(tab)[which.max(tab)]
    }
  }
  grid$p_male <- predict(model, newdata = grid, type = "response")
  lv <- levels(data$sex)
  grid$cls <- factor(ifelse(grid$p_male > 0.5, lv[2], lv[1]), levels = lv)
  ggplot(data, aes(!!sym(f1), !!sym(f2))) +
    geom_raster(data = grid, aes(!!sym(f1), !!sym(f2), fill = cls), alpha = 0.22, inherit.aes = FALSE) +
    geom_point(aes(shape = sex, fill = sex), size = 2.2, color = "gray25") +
    scale_fill_brewer(palette = "Set1") +
    theme_minimal() +
    labs(
      title = "Logistic regions for sex (bill vs flipper)",
      subtitle = "Other predictors fixed; species not used as input",
      fill = "Region", shape = "Sex"
    )
}
print(plot_boundary(log_fit, train, "bill_length_mm", "flipper_length_mm"))

Takeaways

  • Missing sex shrinks usable n — discuss NA reporting and whether imputation is defensible.
  • Omitting species makes errors more interesting; adding it back is a one-line ablation for discussion.
  • Compare difficulty here to Adelie vs Gentoo in penguins-classification.Rmd.