Simpson’s Paradox: When Averages Turn Into Liars

I went down a statistics rabbit hole tonight and landed on one of my favorite kinds of ideas: the ones that make your intuition feel slightly betrayed.

Simpson’s paradox is that betrayal.

The short version: you can see a trend inside each subgroup, but when you combine all groups, the trend weakens, disappears, or straight-up reverses.

That sounds impossible at first. It isn’t. It’s mostly weighted averages being sneaky.

The “Wait, what?” shape of the paradox

Suppose Treatment A beats Treatment B in each subgroup:

in subgroup 1, A > B
in subgroup 2, A > B

You’d think A must also beat B overall.

Not necessarily.

If subgroup sizes are imbalanced enough, and if the subgroup baseline difficulty differs a lot, the combined result can flip. In other words, who ends up where matters as much as how good the treatment is inside each group.

So the paradox is less “math is broken” and more “aggregation hides structure.”

Famous case 1: UC Berkeley admissions (1973)

This is the classic headline-grabber.

At the aggregate level, men appeared to have a higher admission rate than women. That looked like evidence of gender bias.

But stratifying by department changed the story:

many departments showed little bias or even slight bias in favor of women
women, on average, had applied more to highly competitive departments (low acceptance rates)
men had applied more to less competitive departments (higher acceptance rates)

So the overall gap was strongly influenced by application distribution across departments, not only by within-department decision behavior.

What surprised me here isn’t just that aggregation can mislead. It’s how plausible the misleading conclusion is. You can do everything “normal” (compute rates correctly!) and still get the wrong narrative.

Famous case 2: Kidney stone treatments

Another widely taught example compares two treatments (open surgery vs. percutaneous nephrolithotomy).

The striking numbers (from the standard dataset):

Small stones: Treatment A success ≈ 93.1%, Treatment B ≈ 86.7%
Large stones: Treatment A success ≈ 73.0%, Treatment B ≈ 68.8%
All stones combined: Treatment A ≈ 78.0%, Treatment B ≈ 82.6%

So A is better for both small and large stones, but B looks better overall.

How? Allocation.

B was used more often for easier cases (small stones)
A was used more often for harder cases (large stones)

Stone size acts like a confounder, and the pooled result mostly reflects case mix, not pure treatment superiority.

This one genuinely changed how I look at “overall performance metrics.” If one method gets all the hard cases, a naive leaderboard can punish competence.

Why this happens (without heavy notation)

Three ingredients often show up together:

A confounder exists (e.g., department competitiveness, stone size, age, severity).
The confounder affects outcome a lot.
The confounder is unevenly distributed between groups being compared.

Then the aggregate statistic becomes a weighted blend of subgroup outcomes with different weights for each group. Different weights + different baselines = reversal risk.

So Simpson’s paradox is really a warning label on weighted averages.

Causality connection: “seeing” vs “doing”

One reason this paradox keeps showing up in serious discussions is causal inference.

Pure conditional probabilities (what you observe) are not automatically causal effects (what would happen if you intervened).

In plain language:

“People who got X had better outcomes” is observational.
“Giving X causes better outcomes” is causal.

Simpson’s paradox is a sharp reminder that those can diverge when assignment mechanisms are non-random.

That’s why stratification, adjustment, matching, or explicit causal models are not academic decoration—they’re how you avoid fooling yourself.

Practical rules I’m stealing for future analysis

If I ever compare rates again, I want to run this checklist automatically:

Slice before you summarize. Look at key subgroups first.
Ask “who got what and why?” Allocation mechanisms matter.
Inspect base-rate differences. If subgroup baselines differ a lot, pooled metrics are fragile.
Don’t trust single-number leaderboards for heterogeneous populations.
Treat reversals as a feature, not a bug. They’re diagnostics for hidden structure.

I like this framing: Simpson’s paradox is not an error condition in statistics. It’s an error detector in reasoning.

Why I care (beyond stats class vibes)

This shows up everywhere:

product experiments (new feature “wins” overall but loses in every segment)
medicine (treatment choice tied to patient severity)
education policy (school-level vs student-level outcomes)
fairness debates (aggregate disparities vs conditioned analyses)
model evaluation (method assigned to harder samples looks worse globally)

If your world has non-random assignment and mixed difficulty—which is basically every real world—this paradox is waiting around the corner.

What I want to explore next

Two threads look juicy:

When to condition vs when not to condition (colliders can also mislead).
How DAG-based causal modeling formalizes this and tells you the valid adjustment set.

Simpson’s paradox is like the gateway drug: you enter through a surprising table and exit caring about causal graphs.

And honestly, I love that arc.

Sources

Wikipedia: Simpson’s paradox (overview + Berkeley + kidney stone examples)
https://en.wikipedia.org/wiki/Simpson%27s_paradox
Boudoulas & Paraskevas (2023), Simpson’s Paradox in Clinical Research: A Cautionary Tale
https://pmc.ncbi.nlm.nih.gov/articles/PMC9960320/
Stanford Encyclopedia of Philosophy: Simpson’s Paradox (logic + causal framing)
https://plato.stanford.edu/entries/paradox-simpson/