JVM GC Selection for Low-Latency Services: G1 vs ZGC vs Shenandoah (Practical Playbook)
Date: 2026-03-14
Category: knowledge
Audience: Java backend / execution infra engineers running latency-sensitive services
1) Why this matters
In low-latency systems, GC is usually not the average problem โ it is the tail-latency amplifier.
If your p99.9 SLA is tight, one surprise pause can consume most of your latency budget, trigger retries, and start a feedback loop (queue growth โ more allocations โ more GC pressure).
The practical question is not "Which collector is best in general?" but:
Which collector gives the best tail-latency reliability per CPU and memory dollar for this workload?
2) Mental model: each collector optimizes a different frontier
G1 (default in most modern JDKs)
- Goal: balanced throughput + reasonable pauses
- Works well for many general server workloads
- Pause goals are targets, not hard real-time guarantees
- Usually best first baseline when latency constraints are moderate
ZGC (especially Generational ZGC in JDK 21+)
- Goal: extremely low pauses with concurrent GC work
- Strong fit when pause outliers dominate incidents
- Trades additional concurrent CPU/memory overhead for pause stability
- Particularly attractive for large heaps and strict tail SLAs
Shenandoah
- Goal: low-pause operation via concurrent compaction
- Similar strategic niche to ZGC (low-latency first)
- Viability depends on your JDK/vendor/runtime support matrix
3) Quick decision table
| Situation | Start with | Why |
|---|---|---|
| p99 SLA is moderate, throughput matters most | G1 | Lowest migration risk; strong general-purpose behavior |
| p99.9/p99.99 is business-critical and pause spikes hurt | ZGC (Gen ZGC on JDK 21+) | Better tail behavior under allocation churn |
| Need low pauses and your platform already standardizes Shenandoah | Shenandoah | Good low-pause option if support/tooling are mature in your org |
| Heap > several GB and traffic burstiness causes long pause outliers | ZGC or Shenandoah trial | Concurrent collectors handle burst pressure more gracefully |
Rule of thumb:
- If your postmortems say "rare pause spike killed us" โ evaluate ZGC/Shenandoah first.
- If your postmortems say "CPU saturation and throughput collapse" โ optimize allocation rate and G1 tuning first.
4) What to measure before touching flags
Do not start with random JVM options. Start with a measurement pack:
Latency SLO view
- p50 / p95 / p99 / p99.9 / p99.99
- Error-rate during GC-heavy windows
Allocation pressure
- MB/s allocated by endpoint or code path
- Object lifetime shape (short-lived burst vs long-lived cache)
GC log signals (
-Xlog:gc*)- Pause distribution (not just average)
- Concurrent cycle frequency
- Allocation stalls / to-space exhaustion indicators
System coupling signals
- CPU headroom during peak
- Queue depth / backpressure behavior
- Retry burst amplification
Without this, collector changes become superstition.
5) Safe rollout sequence (works in production)
Phase A โ Baseline hardening
- Pin JDK version first (no moving target)
- Fix obvious allocation leaks/churn hotspots
- Lock a reproducible load profile (same traffic replay window)
Phase B โ Controlled A/B
- Keep app code identical
- Change only GC mode + minimal required flags
- Compare for at least one full traffic cycle (including peak)
Phase C โ Tail-first acceptance gate
Adopt candidate only if:
- p99.9 improves materially
- Error bursts during peak are reduced
- CPU/memory overhead remains within explicit budget
Phase D โ Guardrailed rollout
- Canary by shard/tenant/region
- Auto-rollback on tail/error regression
- Keep old profile as one-command rollback preset
6) Starter configs (minimal, not over-tuned)
These are starting points, not final truth.
G1 baseline
-XX:+UseG1GC
-XX:MaxGCPauseMillis=100 # choose from your SLA budget, not folklore
-Xlog:gc*:file=gc.log:time,level,tags
ZGC baseline (JDK 21+)
-XX:+UseZGC
-XX:+ZGenerational
-Xlog:gc*:file=gc.log:time,level,tags
Shenandoah baseline
-XX:+UseShenandoahGC
-Xlog:gc*:file=gc.log:time,level,tags
What to avoid initially:
- Copy-pasting huge flag bundles from blogs
- Mixing many tuning knobs before you establish a baseline delta
- Comparing collectors across different JDK builds or different heap sizes
7) Common failure patterns (and what they usually mean)
Pattern A: p99.9 spikes survive after switching to low-pause GC
Likely causes:
- Non-GC pauses (I/O stalls, safepoint storms, lock contention)
- Allocation burst still exceeds concurrent reclaim capacity
Action:
- Correlate latency spikes with GC + thread dump + kernel scheduler view
Pattern B: GC pauses improved, but throughput fell
Likely causes:
- Concurrent GC CPU overhead stealing cycles from request handling
Action:
- Increase CPU headroom and/or reduce allocation churn
- Re-evaluate whether SLA needs p99.9 strictness for this service tier
Pattern C: Memory usage increased after migration
Likely causes:
- More headroom needed for stable concurrent collection behavior
Action:
- Treat memory headroom as latency insurance cost; budget explicitly
Pattern D: "No win" regardless of collector
Likely causes:
- Object churn architecture issue (serialization storms, temporary objects)
Action:
- Fix allocation topology first (pooling where safe, object reuse, data-path redesign)
8) Practical SLO budgeting for GC
Use a latency budget sheet like:
- End-to-end p99.9 target: 40 ms
- Network + kernel + queueing: 15 ms
- App logic: 18 ms
- GC + runtime jitter budget: 7 ms
If observed GC/runtimes repeatedly spend >7 ms at p99.9, collector choice is a legitimate lever. If GC is already below budget, optimize elsewhere first.
9) Recommendation template (copy/paste)
For each service, write this before rollout:
- Workload profile: (allocation rate, heap live set, burstiness)
- Current pain: (p99.9 spikes? CPU? memory?)
- Candidate collector: (G1 / ZGC / Shenandoah)
- Success gate: (exact p99.9 + error + CPU/memory thresholds)
- Rollback trigger: (what metric crosses what line)
- Owner + review date: (avoid zombie tuning)
This keeps GC decisions operational, not ideological.
10) Bottom line
- G1 remains the safest broad default.
- ZGC (Generational) is often the first serious option when tail-latency incidents dominate.
- Shenandoah is strong where platform support and org familiarity are already in place.
- The winning strategy is always: measure โ controlled A/B โ tail-first acceptance โ guardrailed rollout.
References
- OpenJDK JEP 439: Generational ZGC โ https://openjdk.org/jeps/439
- OpenJDK ZGC Wiki โ https://wiki.openjdk.org/spaces/zgc/pages/34668579/Main
- OpenJDK Shenandoah Wiki โ https://wiki.openjdk.org/display/shenandoah/Main
- Oracle G1 GC Tuning Guide (legacy but useful conceptual baseline) โ https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html