Chapter 8: Comparing learners fairly (random forest, XGBoost, small neural net)

This chapter focuses on fair comparison design. We keep folds, recipe, and metrics fixed, then change only the model family.

This is the natural follow-up to Chapter 4 (tuned tree) and Chapter 7 (which metric to optimize). The Thursday lecture runs the same comparison on slides with metric_set(accuracy) (Part B); this page also reports ROC AUC for side-by-side reading.

Here we fix reasonable hyperparameters so the page stays fast — in coursework you could wrap each in tune_grid like Module 04.

Packages: tidymodels, palmerpenguins, dplyr, tidyr, purrr, plus ranger and xgboost for the engines used below (install from CRAN if fit_resamples() asks for them).

← Chapter 7 · Reading companion

library(tidymodels)
library(palmerpenguins)
library(dplyr)
library(tidyr)
library(purrr)

peng_cls <- palmerpenguins::penguins |>
  dplyr::filter(species %in% c("Adelie", "Gentoo")) |>
  dplyr::mutate(
    y = factor(species, levels = c("Adelie", "Gentoo")),
    year = as.numeric(year)
  ) |>
  dplyr::select(-species, -flipper_length_mm, -body_mass_g) |>
  tidyr::drop_na()

rec <- recipe(y ~ ., data = peng_cls) |>
  step_zv(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

folds <- vfold_cv(peng_cls, v = 5, strata = y)
met <- metric_set(roc_auc, accuracy)

Three model specifications (no tuning — fixed settings)

These settings are intentionally reasonable-but-not-optimized so students can focus on fair comparison design (same folds, same recipe) before running larger tuning grids.

  • RF: many trees, modest mtry for robust baseline behavior.
  • XGBoost: shallow trees + small learning rate for smoother boosting.
  • MLP: small hidden layer and weight decay to limit overfitting on small (n).
rf_spec <- rand_forest(
  mtry = 4,
  trees = 400,
  min_n = 2
) |>
  set_engine("ranger", importance = "impurity") |>
  set_mode("classification")

xgb_spec <- boost_tree(
  trees = 200,
  tree_depth = 3,
  learn_rate = 0.03
) |>
  set_engine("xgboost") |>
  set_mode("classification")

mlp_spec <- mlp(
  hidden_units = 8,
  penalty = 0.1,
  epochs = 150
) |>
  set_engine("nnet", trace = FALSE) |>
  set_mode("classification")

Three workflows — same recipe, different engine

wf_rf <- workflow() |> add_recipe(rec) |> add_model(rf_spec)
wf_xgb <- workflow() |> add_recipe(rec) |> add_model(xgb_spec)
wf_mlp <- workflow() |> add_recipe(rec) |> add_model(mlp_spec)

Cross-validated resampling (fit_resamples)

Each call refits the recipe on analysis folds and scores on assessment folds — fair side-by-side comparison.

What is held constant: data rows, preprocessing steps, folds, and metrics. What changes: only the model family and hyperparameters.

set.seed(7)
rs_rf <- fit_resamples(wf_rf, folds, metrics = met)
rs_xgb <- fit_resamples(wf_xgb, folds, metrics = met)
rs_mlp <- fit_resamples(wf_mlp, folds, metrics = met)

Compare metrics (mean across folds)

Interpretation tip: compare both mean and std_err. A tiny difference in mean (e.g. 0.01 AUC) with overlapping uncertainty is often not a meaningful winner on small data.

cmp <- purrr::imap_dfr(
  list(random_forest = rs_rf, xgboost = rs_xgb, mlp_nnet = rs_mlp),
  \(rs, name) collect_metrics(rs) |> dplyr::mutate(model = name)
)

cmp |>
  dplyr::select(model, .metric, mean, std_err) |>
  dplyr::arrange(.metric, dplyr::desc(mean))
# A tibble: 6 × 4
  model         .metric   mean  std_err
  <chr>         <chr>    <dbl>    <dbl>
1 random_forest accuracy 1     0       
2 mlp_nnet      accuracy 1     0       
3 xgboost       accuracy 0.989 0.00753 
4 random_forest roc_auc  1     0       
5 mlp_nnet      roc_auc  1     0       
6 xgboost       roc_auc  0.999 0.000572

Optional — which fold was hardest?

collect_metrics(rs_xgb, summarize = FALSE) |>
  dplyr::select(id, .metric, .estimate) |>
  tidyr::pivot_wider(names_from = .metric, values_from = .estimate)
# A tibble: 5 × 3
  id    accuracy roc_auc
  <chr>    <dbl>   <dbl>
1 Fold1    0.981   0.999
2 Fold2    1       1    
3 Fold3    1       1    
4 Fold4    0.962   0.997
5 Fold5    1       1    

If one fold is consistently harder, inspect whether it has unusual class balance, covariate mix, or missingness patterns before concluding a model is worse.

Takeaways

  • Same folds + same recipe means differences come mainly from model family and your chosen settings, not from a sneaky preprocessing shortcut.
  • On small datasets, simpler models sometimes win by noise; report uncertainty (std_err from collect_metrics) and avoid over-interpreting tiny AUC gaps.
  • A small MLP can underperform on tabular data because neural nets are sensitive to feature representation and tuning; with limited (n), tree ensembles often provide stronger bias-variance tradeoffs out of the box.
  • To go deeper on metrics and explainability, continue with Chapter 6 and Chapter 5.