Kubernetes Autoscaling Stack Design Playbook (HPA, VPA, KEDA, Node Autoscaler)
Date: 2026-03-24
Category: knowledge
Scope: Practical system design and operations guide for combining workload autoscaling (HPA/VPA/KEDA) with node autoscaling in production.
1) Why teams struggle with autoscaling
Most production incidents around autoscaling are not caused by a single broken controller. They come from control-loop interaction:
- HPA reacts to demand by increasing pod replicas.
- VPA changes per-pod CPU/memory requests.
- KEDA may wake workloads from 0 replicas based on external triggers.
- Node autoscaler provisions capacity when pods become unschedulable.
If these loops are tuned independently, you get oscillation, delayed recovery, or unnecessary cost.
2) Mental model: four loops, different time constants
Think of autoscaling as layered loops:
- Fast loop (seconds): HPA reconcile interval (default ~15s in controller-manager docs) adjusts replica count.
- Event loop (seconds to minutes): KEDA polls trigger sources (default 30s) and controls scale-to-zero / wake-up behavior.
- Right-sizing loop (minutes to hours): VPA recommender/update cycle adjusts requests/limits.
- Capacity loop (minutes): Node autoscaler provisions/consolidates nodes based on schedulability and requested resources.
Design rule: faster loops should absorb burst; slower loops should optimize efficiency.
3) What each autoscaler is best at
3.1 HPA (Horizontal Pod Autoscaler)
Best for:
- request-driven, latency-sensitive services,
- burst handling via replica expansion,
- multi-metric policies (resource/custom/external).
Key behavior notes from Kubernetes docs:
- Scaling formula is ratio-based around current vs desired metric.
- Default tolerance is 0.1 around the target to avoid noisy scaling.
- Downscale stabilization exists to smooth rapid metric swings.
- Missing/unready pod metrics are handled conservatively.
Operational implication: HPA is your primary burst absorber.
3.2 VPA (Vertical Pod Autoscaler)
Best for:
- right-sizing under/over-requested workloads,
- reducing waste and improving bin-packing over time,
- giving a recommendation baseline even in
Offmode.
Important constraints:
- VPA is a CRD/add-on, not core like HPA.
- Update modes (
Off,Initial,Recreate,InPlaceOrRecreate) materially affect disruption profile. - VPA project docs state it should not be used with HPA on the same resource metric (CPU/memory).
Operational implication: start with Off mode for learning, then selectively enable enforcement.
3.3 KEDA
Best for:
- external/event sources (queue lag, stream depth, cloud metrics),
- scale-to-zero workers,
- trigger-specific semantics via ScaledObject.
Key defaults from KEDA docs:
pollingInterval: 30scooldownPeriod: 300s- optional fallback replicas when trigger backend errors repeatedly.
Operational implication: KEDA is the best bridge from external backlog → HPA-compatible scaling.
3.4 Node autoscaler (Cluster Autoscaler / Karpenter class)
Best for:
- making pending pods schedulable,
- consolidating underutilized nodes to reduce cost.
Kubernetes node autoscaling concepts now frame this as:
- Provisioning (formerly scale-up)
- Consolidation (formerly scale-down)
Operational implication: node autoscaling only sees requests/scheduling constraints, not true runtime usage.
4) Decision matrix (quick selection)
HTTP API, p95/p99 SLO strict, traffic bursty
→ HPA on CPU + request-rate/latency proxy metric, VPA inOff, node autoscaler enabled.Queue workers with idle periods
→ KEDA ScaledObject + HPA behavior tuning + node autoscaler; optionally VPA for memory right-sizing.Steady state services with chronic over-requesting
→ VPA recommendations first; adoptInitialorRecreatewhere disruption budget allows.Batch/cron jobs with wide size variance
→ prefer good requests and node autoscaling capacity strategy; use VPA carefully (mostly recommender value, less frequent enforcement).
5) Golden compatibility rules
Do not run HPA and VPA on the same CPU/memory metric target.
Use separation (for example HPA on external/custom metric, VPA on memory) if needed.Replica control belongs to one owner at a time.
If KEDA is wrapping HPA behavior, treat KEDA+generated HPA as the replica authority.Right-size before over-optimizing node policy.
Bad pod requests poison both HPA signal quality and node consolidation quality.Stabilization beats reactivity for cost-sensitive systems.
Avoid scaling on every transient spike.
6) Baseline tuning templates
6.1 HPA behavior (safe default)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 3
maxReplicas: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 6
periodSeconds: 15
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 20
periodSeconds: 60
selectPolicy: Min
Pattern: fast up, slow down.
6.2 VPA adoption ladder
- Stage 1:
updateMode: Offfor 1–2 weeks (collect recommendations vs actual). - Stage 2:
Initialfor newly created pods only. - Stage 3:
RecreateorInPlaceOrRecreatefor selected workloads with tolerant disruption budget.
Always set minAllowed / maxAllowed bounds to avoid pathological recommendations.
6.3 KEDA queue-worker baseline
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
pollingInterval: 30
cooldownPeriod: 300
minReplicaCount: 0
maxReplicaCount: 100
fallback:
failureThreshold: 3
replicas: 6
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
Pattern: predictable wake-up, conservative drain-down, explicit degraded-mode fallback.
7) Failure modes you should expect
7.1 HPA says scale up, but pods stay Pending
Cause: node capacity loop lagging, or requests too large to fit node shapes.
Fix:
- verify unschedulable reasons (cpu/memory/affinity/taints/volumes),
- correct requests,
- ensure node autoscaler limits and instance shapes match workload envelope.
7.2 VPA recommendations cause unschedulable pods
Cause: recommended requests exceed largest allocatable node profile.
Fix:
- enforce
maxAllowedcaps, - combine with node autoscaler strategy that includes larger shapes,
- review multi-container total request inflation.
7.3 Queue scaler oscillates between 0 and N
Cause: short polling/cooldown against bursty trigger.
Fix:
- increase cooldown,
- set non-zero
minReplicaCountfor hot paths, - smooth trigger metric at source when possible.
7.4 Cost blowout from “always scaling out first”
Cause: aggressive scale-up + weak scale-down stabilization + over-requested pods.
Fix:
- improve pod request hygiene,
- use HPA downscale stabilization and stricter downscale policies,
- tune consolidation window on node autoscaler.
8) Production rollout sequence (recommended)
- Instrument first: queue depth, service saturation, pending pods, node churn, eviction counts.
- Enable/tune HPA on one tier with clear SLO and rollback threshold.
- Add node autoscaling guardrails (min/max, allowed instance classes, disruption policy).
- Introduce KEDA for event-driven workers where scale-to-zero gives clear value.
- Run VPA in Off mode and compare recommendations to SLO/cost outcomes.
- Promote VPA enforcement selectively with PDB-aware disruption windows.
One layer at a time. Never flip all controllers to “auto” in one change set.
9) What to monitor (minimum dashboard)
- HPA desired vs current replicas, scaling events, and stabilization effects.
- KEDA trigger value, scaler errors, fallback activation count.
- VPA recommendation deltas, evictions/restarts, in-place resize success/failure.
- Pending pods duration and unschedulable reasons.
- Node provisioning latency, consolidation events, node churn.
- Service SLOs (latency/error) correlated with scaling actions.
If you cannot explain a scaling decision from telemetry in under 5 minutes, observability is insufficient.
10) Bottom line
Treat autoscaling as a coordinated control system, not four separate features.
- HPA/KEDA handle demand response.
- VPA fixes request quality and long-horizon efficiency.
- Node autoscaler converts requests into actual capacity.
Most teams don’t need “more aggressive autoscaling.” They need clean ownership of each loop, bounded policies, and better request hygiene.
References
- Kubernetes HPA concepts: https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
- Kubernetes VPA concepts: https://kubernetes.io/docs/concepts/workloads/autoscaling/vertical-pod-autoscale/
- Kubernetes autoscaling overview: https://kubernetes.io/docs/concepts/workloads/autoscaling/
- Kubernetes node autoscaling (provisioning/consolidation): https://kubernetes.io/docs/concepts/cluster-administration/node-autoscaling/
- KEDA ScaledObject spec: https://keda.sh/docs/2.19/reference/scaledobject-spec/
- Cluster Autoscaler FAQ: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
- VPA known limitations (HPA compatibility note): https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/docs/known-limitations.md