Kubernetes Autoscaling Stack Design Playbook (HPA, VPA, KEDA, Node Autoscaler)

Date: 2026-03-24
Category: knowledge
Scope: Practical system design and operations guide for combining workload autoscaling (HPA/VPA/KEDA) with node autoscaling in production.

1) Why teams struggle with autoscaling

Most production incidents around autoscaling are not caused by a single broken controller. They come from control-loop interaction:

HPA reacts to demand by increasing pod replicas.
VPA changes per-pod CPU/memory requests.
KEDA may wake workloads from 0 replicas based on external triggers.
Node autoscaler provisions capacity when pods become unschedulable.

If these loops are tuned independently, you get oscillation, delayed recovery, or unnecessary cost.

2) Mental model: four loops, different time constants

Think of autoscaling as layered loops:

Fast loop (seconds): HPA reconcile interval (default ~15s in controller-manager docs) adjusts replica count.
Event loop (seconds to minutes): KEDA polls trigger sources (default 30s) and controls scale-to-zero / wake-up behavior.
Right-sizing loop (minutes to hours): VPA recommender/update cycle adjusts requests/limits.
Capacity loop (minutes): Node autoscaler provisions/consolidates nodes based on schedulability and requested resources.

Design rule: faster loops should absorb burst; slower loops should optimize efficiency.

3) What each autoscaler is best at

3.1 HPA (Horizontal Pod Autoscaler)

Best for:

request-driven, latency-sensitive services,
burst handling via replica expansion,
multi-metric policies (resource/custom/external).

Key behavior notes from Kubernetes docs:

Scaling formula is ratio-based around current vs desired metric.
Default tolerance is 0.1 around the target to avoid noisy scaling.
Downscale stabilization exists to smooth rapid metric swings.
Missing/unready pod metrics are handled conservatively.

Operational implication: HPA is your primary burst absorber.

3.2 VPA (Vertical Pod Autoscaler)

Best for:

right-sizing under/over-requested workloads,
reducing waste and improving bin-packing over time,
giving a recommendation baseline even in Off mode.

Important constraints:

VPA is a CRD/add-on, not core like HPA.
Update modes (Off, Initial, Recreate, InPlaceOrRecreate) materially affect disruption profile.
VPA project docs state it should not be used with HPA on the same resource metric (CPU/memory).

Operational implication: start with Off mode for learning, then selectively enable enforcement.

3.3 KEDA

Best for:

external/event sources (queue lag, stream depth, cloud metrics),
scale-to-zero workers,
trigger-specific semantics via ScaledObject.

Key defaults from KEDA docs:

pollingInterval: 30s
cooldownPeriod: 300s
optional fallback replicas when trigger backend errors repeatedly.

Operational implication: KEDA is the best bridge from external backlog → HPA-compatible scaling.

3.4 Node autoscaler (Cluster Autoscaler / Karpenter class)

Best for:

making pending pods schedulable,
consolidating underutilized nodes to reduce cost.

Kubernetes node autoscaling concepts now frame this as:

Provisioning (formerly scale-up)
Consolidation (formerly scale-down)

Operational implication: node autoscaling only sees requests/scheduling constraints, not true runtime usage.

4) Decision matrix (quick selection)

HTTP API, p95/p99 SLO strict, traffic bursty
→ HPA on CPU + request-rate/latency proxy metric, VPA in Off, node autoscaler enabled.
Queue workers with idle periods
→ KEDA ScaledObject + HPA behavior tuning + node autoscaler; optionally VPA for memory right-sizing.
Steady state services with chronic over-requesting
→ VPA recommendations first; adopt Initial or Recreate where disruption budget allows.
Batch/cron jobs with wide size variance
→ prefer good requests and node autoscaling capacity strategy; use VPA carefully (mostly recommender value, less frequent enforcement).

5) Golden compatibility rules

Do not run HPA and VPA on the same CPU/memory metric target.
Use separation (for example HPA on external/custom metric, VPA on memory) if needed.
Replica control belongs to one owner at a time.
If KEDA is wrapping HPA behavior, treat KEDA+generated HPA as the replica authority.
Right-size before over-optimizing node policy.
Bad pod requests poison both HPA signal quality and node consolidation quality.
Stabilization beats reactivity for cost-sensitive systems.
Avoid scaling on every transient spike.

6) Baseline tuning templates

6.1 HPA behavior (safe default)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 3
  maxReplicas: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 6
          periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60
      selectPolicy: Min

Pattern: fast up, slow down.

6.2 VPA adoption ladder

Stage 1: updateMode: Off for 1–2 weeks (collect recommendations vs actual).
Stage 2: Initial for newly created pods only.
Stage 3: Recreate or InPlaceOrRecreate for selected workloads with tolerant disruption budget.

Always set minAllowed / maxAllowed bounds to avoid pathological recommendations.

6.3 KEDA queue-worker baseline

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
  pollingInterval: 30
  cooldownPeriod: 300
  minReplicaCount: 0
  maxReplicaCount: 100
  fallback:
    failureThreshold: 3
    replicas: 6
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300

Pattern: predictable wake-up, conservative drain-down, explicit degraded-mode fallback.

7) Failure modes you should expect

7.1 HPA says scale up, but pods stay Pending

Cause: node capacity loop lagging, or requests too large to fit node shapes.
Fix:

verify unschedulable reasons (cpu/memory/affinity/taints/volumes),
correct requests,
ensure node autoscaler limits and instance shapes match workload envelope.

7.2 VPA recommendations cause unschedulable pods

Cause: recommended requests exceed largest allocatable node profile.
Fix:

enforce maxAllowed caps,
combine with node autoscaler strategy that includes larger shapes,
review multi-container total request inflation.

7.3 Queue scaler oscillates between 0 and N

Cause: short polling/cooldown against bursty trigger.
Fix:

increase cooldown,
set non-zero minReplicaCount for hot paths,
smooth trigger metric at source when possible.

7.4 Cost blowout from “always scaling out first”

Cause: aggressive scale-up + weak scale-down stabilization + over-requested pods.
Fix:

improve pod request hygiene,
use HPA downscale stabilization and stricter downscale policies,
tune consolidation window on node autoscaler.

8) Production rollout sequence (recommended)

Instrument first: queue depth, service saturation, pending pods, node churn, eviction counts.
Enable/tune HPA on one tier with clear SLO and rollback threshold.
Add node autoscaling guardrails (min/max, allowed instance classes, disruption policy).
Introduce KEDA for event-driven workers where scale-to-zero gives clear value.
Run VPA in Off mode and compare recommendations to SLO/cost outcomes.
Promote VPA enforcement selectively with PDB-aware disruption windows.

One layer at a time. Never flip all controllers to “auto” in one change set.

9) What to monitor (minimum dashboard)

HPA desired vs current replicas, scaling events, and stabilization effects.
KEDA trigger value, scaler errors, fallback activation count.
VPA recommendation deltas, evictions/restarts, in-place resize success/failure.
Pending pods duration and unschedulable reasons.
Node provisioning latency, consolidation events, node churn.
Service SLOs (latency/error) correlated with scaling actions.

If you cannot explain a scaling decision from telemetry in under 5 minutes, observability is insufficient.

10) Bottom line

Treat autoscaling as a coordinated control system, not four separate features.

HPA/KEDA handle demand response.
VPA fixes request quality and long-horizon efficiency.
Node autoscaler converts requests into actual capacity.

Most teams don’t need “more aggressive autoscaling.” They need clean ownership of each loop, bounded policies, and better request hygiene.

References

Kubernetes HPA concepts: https://kubernetes.io/docs/concepts/workloads/autoscaling/horizontal-pod-autoscale/
Kubernetes VPA concepts: https://kubernetes.io/docs/concepts/workloads/autoscaling/vertical-pod-autoscale/
Kubernetes autoscaling overview: https://kubernetes.io/docs/concepts/workloads/autoscaling/
Kubernetes node autoscaling (provisioning/consolidation): https://kubernetes.io/docs/concepts/cluster-administration/node-autoscaling/
KEDA ScaledObject spec: https://keda.sh/docs/2.19/reference/scaledobject-spec/
Cluster Autoscaler FAQ: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md
VPA known limitations (HPA compatibility note): https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/docs/known-limitations.md