JVM GC Selection for Low-Latency Services: G1 vs ZGC vs Shenandoah (Practical Playbook)

Date: 2026-03-14
Category: knowledge
Audience: Java backend / execution infra engineers running latency-sensitive services

1) Why this matters

In low-latency systems, GC is usually not the average problem — it is the tail-latency amplifier.

If your p99.9 SLA is tight, one surprise pause can consume most of your latency budget, trigger retries, and start a feedback loop (queue growth → more allocations → more GC pressure).

The practical question is not "Which collector is best in general?" but:

Which collector gives the best tail-latency reliability per CPU and memory dollar for this workload?

2) Mental model: each collector optimizes a different frontier

G1 (default in most modern JDKs)

Goal: balanced throughput + reasonable pauses
Works well for many general server workloads
Pause goals are targets, not hard real-time guarantees
Usually best first baseline when latency constraints are moderate

ZGC (especially Generational ZGC in JDK 21+)

Goal: extremely low pauses with concurrent GC work
Strong fit when pause outliers dominate incidents
Trades additional concurrent CPU/memory overhead for pause stability
Particularly attractive for large heaps and strict tail SLAs

Shenandoah

Goal: low-pause operation via concurrent compaction
Similar strategic niche to ZGC (low-latency first)
Viability depends on your JDK/vendor/runtime support matrix

3) Quick decision table

Situation	Start with	Why
p99 SLA is moderate, throughput matters most	G1	Lowest migration risk; strong general-purpose behavior
p99.9/p99.99 is business-critical and pause spikes hurt	ZGC (Gen ZGC on JDK 21+)	Better tail behavior under allocation churn
Need low pauses and your platform already standardizes Shenandoah	Shenandoah	Good low-pause option if support/tooling are mature in your org
Heap > several GB and traffic burstiness causes long pause outliers	ZGC or Shenandoah trial	Concurrent collectors handle burst pressure more gracefully

Rule of thumb:

If your postmortems say "rare pause spike killed us" → evaluate ZGC/Shenandoah first.
If your postmortems say "CPU saturation and throughput collapse" → optimize allocation rate and G1 tuning first.

4) What to measure before touching flags

Do not start with random JVM options. Start with a measurement pack:

Latency SLO view
- p50 / p95 / p99 / p99.9 / p99.99
- Error-rate during GC-heavy windows
Allocation pressure
- MB/s allocated by endpoint or code path
- Object lifetime shape (short-lived burst vs long-lived cache)
GC log signals (-Xlog:gc*)
- Pause distribution (not just average)
- Concurrent cycle frequency
- Allocation stalls / to-space exhaustion indicators
System coupling signals
- CPU headroom during peak
- Queue depth / backpressure behavior
- Retry burst amplification

Without this, collector changes become superstition.

5) Safe rollout sequence (works in production)

Phase A — Baseline hardening

Pin JDK version first (no moving target)
Fix obvious allocation leaks/churn hotspots
Lock a reproducible load profile (same traffic replay window)

Phase B — Controlled A/B

Keep app code identical
Change only GC mode + minimal required flags
Compare for at least one full traffic cycle (including peak)

Phase C — Tail-first acceptance gate

Adopt candidate only if:

p99.9 improves materially
Error bursts during peak are reduced
CPU/memory overhead remains within explicit budget

Phase D — Guardrailed rollout

Canary by shard/tenant/region
Auto-rollback on tail/error regression
Keep old profile as one-command rollback preset

6) Starter configs (minimal, not over-tuned)

These are starting points, not final truth.

G1 baseline

-XX:+UseG1GC
-XX:MaxGCPauseMillis=100   # choose from your SLA budget, not folklore
-Xlog:gc*:file=gc.log:time,level,tags

ZGC baseline (JDK 21+)

-XX:+UseZGC
-XX:+ZGenerational
-Xlog:gc*:file=gc.log:time,level,tags

Shenandoah baseline

-XX:+UseShenandoahGC
-Xlog:gc*:file=gc.log:time,level,tags

What to avoid initially:

Copy-pasting huge flag bundles from blogs
Mixing many tuning knobs before you establish a baseline delta
Comparing collectors across different JDK builds or different heap sizes

7) Common failure patterns (and what they usually mean)

Pattern A: p99.9 spikes survive after switching to low-pause GC

Likely causes:

Non-GC pauses (I/O stalls, safepoint storms, lock contention)
Allocation burst still exceeds concurrent reclaim capacity

Action:

Correlate latency spikes with GC + thread dump + kernel scheduler view

Pattern B: GC pauses improved, but throughput fell

Likely causes:

Concurrent GC CPU overhead stealing cycles from request handling

Action:

Increase CPU headroom and/or reduce allocation churn
Re-evaluate whether SLA needs p99.9 strictness for this service tier

Pattern C: Memory usage increased after migration

Likely causes:

More headroom needed for stable concurrent collection behavior

Action:

Treat memory headroom as latency insurance cost; budget explicitly

Pattern D: "No win" regardless of collector

Likely causes:

Object churn architecture issue (serialization storms, temporary objects)

Action:

Fix allocation topology first (pooling where safe, object reuse, data-path redesign)

8) Practical SLO budgeting for GC

Use a latency budget sheet like:

End-to-end p99.9 target: 40 ms
Network + kernel + queueing: 15 ms
App logic: 18 ms
GC + runtime jitter budget: 7 ms

If observed GC/runtimes repeatedly spend >7 ms at p99.9, collector choice is a legitimate lever. If GC is already below budget, optimize elsewhere first.

9) Recommendation template (copy/paste)

For each service, write this before rollout:

Workload profile: (allocation rate, heap live set, burstiness)
Current pain: (p99.9 spikes? CPU? memory?)
Candidate collector: (G1 / ZGC / Shenandoah)
Success gate: (exact p99.9 + error + CPU/memory thresholds)
Rollback trigger: (what metric crosses what line)
Owner + review date: (avoid zombie tuning)

This keeps GC decisions operational, not ideological.

10) Bottom line

G1 remains the safest broad default.
ZGC (Generational) is often the first serious option when tail-latency incidents dominate.
Shenandoah is strong where platform support and org familiarity are already in place.
The winning strategy is always: measure → controlled A/B → tail-first acceptance → guardrailed rollout.

References

OpenJDK JEP 439: Generational ZGC — https://openjdk.org/jeps/439
OpenJDK ZGC Wiki — https://wiki.openjdk.org/spaces/zgc/pages/34668579/Main
OpenJDK Shenandoah Wiki — https://wiki.openjdk.org/display/shenandoah/Main
Oracle G1 GC Tuning Guide (legacy but useful conceptual baseline) — https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html