JVM GC Selection for Low-Latency Services: G1 vs ZGC vs Shenandoah (Practical Playbook)

2026-03-14 ยท software

JVM GC Selection for Low-Latency Services: G1 vs ZGC vs Shenandoah (Practical Playbook)

Date: 2026-03-14
Category: knowledge
Audience: Java backend / execution infra engineers running latency-sensitive services


1) Why this matters

In low-latency systems, GC is usually not the average problem โ€” it is the tail-latency amplifier.

If your p99.9 SLA is tight, one surprise pause can consume most of your latency budget, trigger retries, and start a feedback loop (queue growth โ†’ more allocations โ†’ more GC pressure).

The practical question is not "Which collector is best in general?" but:

Which collector gives the best tail-latency reliability per CPU and memory dollar for this workload?


2) Mental model: each collector optimizes a different frontier

G1 (default in most modern JDKs)

ZGC (especially Generational ZGC in JDK 21+)

Shenandoah


3) Quick decision table

Situation Start with Why
p99 SLA is moderate, throughput matters most G1 Lowest migration risk; strong general-purpose behavior
p99.9/p99.99 is business-critical and pause spikes hurt ZGC (Gen ZGC on JDK 21+) Better tail behavior under allocation churn
Need low pauses and your platform already standardizes Shenandoah Shenandoah Good low-pause option if support/tooling are mature in your org
Heap > several GB and traffic burstiness causes long pause outliers ZGC or Shenandoah trial Concurrent collectors handle burst pressure more gracefully

Rule of thumb:


4) What to measure before touching flags

Do not start with random JVM options. Start with a measurement pack:

  1. Latency SLO view

    • p50 / p95 / p99 / p99.9 / p99.99
    • Error-rate during GC-heavy windows
  2. Allocation pressure

    • MB/s allocated by endpoint or code path
    • Object lifetime shape (short-lived burst vs long-lived cache)
  3. GC log signals (-Xlog:gc*)

    • Pause distribution (not just average)
    • Concurrent cycle frequency
    • Allocation stalls / to-space exhaustion indicators
  4. System coupling signals

    • CPU headroom during peak
    • Queue depth / backpressure behavior
    • Retry burst amplification

Without this, collector changes become superstition.


5) Safe rollout sequence (works in production)

Phase A โ€” Baseline hardening

Phase B โ€” Controlled A/B

Phase C โ€” Tail-first acceptance gate

Adopt candidate only if:

Phase D โ€” Guardrailed rollout


6) Starter configs (minimal, not over-tuned)

These are starting points, not final truth.

G1 baseline

-XX:+UseG1GC
-XX:MaxGCPauseMillis=100   # choose from your SLA budget, not folklore
-Xlog:gc*:file=gc.log:time,level,tags

ZGC baseline (JDK 21+)

-XX:+UseZGC
-XX:+ZGenerational
-Xlog:gc*:file=gc.log:time,level,tags

Shenandoah baseline

-XX:+UseShenandoahGC
-Xlog:gc*:file=gc.log:time,level,tags

What to avoid initially:


7) Common failure patterns (and what they usually mean)

Pattern A: p99.9 spikes survive after switching to low-pause GC

Likely causes:

Action:

Pattern B: GC pauses improved, but throughput fell

Likely causes:

Action:

Pattern C: Memory usage increased after migration

Likely causes:

Action:

Pattern D: "No win" regardless of collector

Likely causes:

Action:


8) Practical SLO budgeting for GC

Use a latency budget sheet like:

If observed GC/runtimes repeatedly spend >7 ms at p99.9, collector choice is a legitimate lever. If GC is already below budget, optimize elsewhere first.


9) Recommendation template (copy/paste)

For each service, write this before rollout:

This keeps GC decisions operational, not ideological.


10) Bottom line


References