Kubernetes Autoscaling Stack Design Playbook (HPA, VPA, KEDA, Node Autoscaler)

2026-03-24 · software

Kubernetes Autoscaling Stack Design Playbook (HPA, VPA, KEDA, Node Autoscaler)

Date: 2026-03-24
Category: knowledge
Scope: Practical system design and operations guide for combining workload autoscaling (HPA/VPA/KEDA) with node autoscaling in production.


1) Why teams struggle with autoscaling

Most production incidents around autoscaling are not caused by a single broken controller. They come from control-loop interaction:

If these loops are tuned independently, you get oscillation, delayed recovery, or unnecessary cost.


2) Mental model: four loops, different time constants

Think of autoscaling as layered loops:

  1. Fast loop (seconds): HPA reconcile interval (default ~15s in controller-manager docs) adjusts replica count.
  2. Event loop (seconds to minutes): KEDA polls trigger sources (default 30s) and controls scale-to-zero / wake-up behavior.
  3. Right-sizing loop (minutes to hours): VPA recommender/update cycle adjusts requests/limits.
  4. Capacity loop (minutes): Node autoscaler provisions/consolidates nodes based on schedulability and requested resources.

Design rule: faster loops should absorb burst; slower loops should optimize efficiency.


3) What each autoscaler is best at

3.1 HPA (Horizontal Pod Autoscaler)

Best for:

Key behavior notes from Kubernetes docs:

Operational implication: HPA is your primary burst absorber.

3.2 VPA (Vertical Pod Autoscaler)

Best for:

Important constraints:

Operational implication: start with Off mode for learning, then selectively enable enforcement.

3.3 KEDA

Best for:

Key defaults from KEDA docs:

Operational implication: KEDA is the best bridge from external backlog → HPA-compatible scaling.

3.4 Node autoscaler (Cluster Autoscaler / Karpenter class)

Best for:

Kubernetes node autoscaling concepts now frame this as:

Operational implication: node autoscaling only sees requests/scheduling constraints, not true runtime usage.


4) Decision matrix (quick selection)


5) Golden compatibility rules

  1. Do not run HPA and VPA on the same CPU/memory metric target.
    Use separation (for example HPA on external/custom metric, VPA on memory) if needed.

  2. Replica control belongs to one owner at a time.
    If KEDA is wrapping HPA behavior, treat KEDA+generated HPA as the replica authority.

  3. Right-size before over-optimizing node policy.
    Bad pod requests poison both HPA signal quality and node consolidation quality.

  4. Stabilization beats reactivity for cost-sensitive systems.
    Avoid scaling on every transient spike.


6) Baseline tuning templates

6.1 HPA behavior (safe default)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 3
  maxReplicas: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 6
          periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60
      selectPolicy: Min

Pattern: fast up, slow down.

6.2 VPA adoption ladder

Always set minAllowed / maxAllowed bounds to avoid pathological recommendations.

6.3 KEDA queue-worker baseline

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
  pollingInterval: 30
  cooldownPeriod: 300
  minReplicaCount: 0
  maxReplicaCount: 100
  fallback:
    failureThreshold: 3
    replicas: 6
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300

Pattern: predictable wake-up, conservative drain-down, explicit degraded-mode fallback.


7) Failure modes you should expect

7.1 HPA says scale up, but pods stay Pending

Cause: node capacity loop lagging, or requests too large to fit node shapes.
Fix:

7.2 VPA recommendations cause unschedulable pods

Cause: recommended requests exceed largest allocatable node profile.
Fix:

7.3 Queue scaler oscillates between 0 and N

Cause: short polling/cooldown against bursty trigger.
Fix:

7.4 Cost blowout from “always scaling out first”

Cause: aggressive scale-up + weak scale-down stabilization + over-requested pods.
Fix:


8) Production rollout sequence (recommended)

  1. Instrument first: queue depth, service saturation, pending pods, node churn, eviction counts.
  2. Enable/tune HPA on one tier with clear SLO and rollback threshold.
  3. Add node autoscaling guardrails (min/max, allowed instance classes, disruption policy).
  4. Introduce KEDA for event-driven workers where scale-to-zero gives clear value.
  5. Run VPA in Off mode and compare recommendations to SLO/cost outcomes.
  6. Promote VPA enforcement selectively with PDB-aware disruption windows.

One layer at a time. Never flip all controllers to “auto” in one change set.


9) What to monitor (minimum dashboard)

If you cannot explain a scaling decision from telemetry in under 5 minutes, observability is insufficient.


10) Bottom line

Treat autoscaling as a coordinated control system, not four separate features.

Most teams don’t need “more aggressive autoscaling.” They need clean ownership of each loop, bounded policies, and better request hygiene.


References