Kubernetes PodDisruptionBudget + Eviction API Playbook

How to make node maintenance, autoscaling, and upgrades boring instead of outage-prone.

Why this matters

Most Kubernetes incidents during maintenance are not caused by a broken rollout script—they come from mismatch between workload availability assumptions and disruption controls.

Typical failure pattern:

Team sets an overly strict PDB (maxUnavailable: 0 or minAvailable: 100%)
Node drain starts
Evictions are rejected forever (429), maintenance pipelines hang
Operators bypass with force-delete, causing avoidable downtime

This playbook gives a practical operating model to avoid that trap.

1) Mental model: three different “unavailability budgets”

You need to reason about three separate controls:

Workload rollout budget (e.g., Deployment maxUnavailable, maxSurge)
- Controls app updates from controller strategy.
Disruption budget (PDB minAvailable or maxUnavailable)
- Controls voluntary evictions via Eviction API (drain, autoscaler, managed upgrades).
Involuntary failure budget (node crash, kernel panic, AZ loss)
- Not preventable by PDB; but involuntary loss still counts against current availability.

Key nuance: PDB does not govern all pod removals. Direct deletes can bypass it.

2) Ground rules (non-negotiable)

Rule A — Never use “zero voluntary eviction” by accident

maxUnavailable: 0 (or 0%)
minAvailable: 100%
minAvailable == replicas

These settings can make drains impossible by design.

Rule B — Prefer `maxUnavailable` for elastic workloads

For many replicated services, maxUnavailable: 1 is safer operationally than fixed minAvailable because it scales with replica count changes.

Rule C — Set unhealthy pod eviction behavior intentionally

Use:

unhealthyPodEvictionPolicy: AlwaysAllow

Why: default behavior can block drains when pods are running but not Ready (CrashLoopBackOff, broken probes, etc.).

Rule D — Use Eviction API path for maintenance

kubectl drain (without --disable-eviction) respects PDB and graceful termination. Direct DELETE is a last-resort break-glass path.

3) PDB design recipes by workload type

A) Stateless API / web tier

Replicas: 6+
Start point: maxUnavailable: 1 (or 10–20% for large pools)
Add topology spread / anti-affinity to avoid single-node concentration.

B) Quorum systems (etcd, ZooKeeper-like)

Use quorum-aware minimums.
Example for 5-member quorum service:
- Option 1: maxUnavailable: 1
- Option 2: minAvailable: 3

C) Singleton stateful workload

If true downtime is unacceptable without explicit approval:
- Use strict PDB intentionally (maxUnavailable: 0) plus human change process.
Don’t pretend this is “high availability”; it is “explicit maintenance gate”.

D) Batch / replaceable jobs

Often no PDB needed; job controller restarts elsewhere.
Use PDB only when batch interruption has material side effects.

4) Interlock: Deployment rolling update vs PDB

A common confusion:

Deployment maxUnavailable applies to rollout behavior.
PDB applies to eviction-driven disruptions.

These are independent levers. If both are too strict, you can deadlock operations.

Practical heuristic:

Keep rollout budget and disruption budget aligned to your real SLO.
If rollout is maxUnavailable: 1, start PDB near that posture unless there is a clear reason not to.

5) Node drain runbook (production-safe)

Preflight

Confirm target node pod mix:

kubectl get pods -A -o wide --field-selector spec.nodeName=<node>

Confirm PDB headroom:
```
kubectl get pdb -A
```
Watch allowed disruptions + readiness for critical namespaces.

Drain

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --timeout=30m

Notes:

--ignore-daemonsets is usually required.
--delete-emptydir-data is a policy call: data loss is expected for emptyDir.
Avoid --disable-eviction unless break-glass.

During drain

If blocked, check events and PDB status.
Look for pods stuck non-Ready; consider unhealthyPodEvictionPolicy: AlwaysAllow where appropriate.

Restore

kubectl uncordon <node>

6) Failure modes and fast fixes

Symptom: drain hangs forever

Likely causes:

PDB allows zero evictions
Not enough spare capacity to place replacement pods
readiness never turns True

Fix order:

Add temporary capacity
Repair unhealthy rollout/probes
Adjust PDB to realistic threshold
Use direct delete only with explicit incident decision

Symptom: Eviction API returns 429 repeatedly

Interpretation:

Budget currently exhausted (or API throttling).

Actions:

Wait/retry with backoff
Observe disruptionsAllowed
Ensure controllers can actually schedule replacements

Symptom: Eviction API returns 500

Interpretation:

Often policy/config issue (e.g., overlapping PDB selection).

Actions:

Audit selectors
Ensure one pod set is not ambiguously governed by multiple conflicting budgets

7) Baseline manifests

PDB for stateless service

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  maxUnavailable: 1
  unhealthyPodEvictionPolicy: AlwaysAllow
  selector:
    matchLabels:
      app: api

Deployment alignment example

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 1

This pairing usually yields predictable updates + maintainable drains.

8) Operational SLO checks (recommended)

Track these continuously:

pdb_disruptions_allowed == 0 duration by workload
Drain duration percentile (p50/p95)
Eviction API response mix (200/429/500)
Pod readiness recovery time after eviction
Fraction of drains using break-glass deletion

If break-glass deletions become normal, your disruption model is wrong.

9) Compact checklist

Before enabling automated node maintenance:

Every critical workload has an explicit PDB
No accidental zero-eviction PDBs
unhealthyPodEvictionPolicy intentionally set
Spare capacity exists for replacement scheduling
Drain SLO and 429 alerting in place
Break-glass delete requires explicit incident annotation

References

Kubernetes: Disruptions
https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
Kubernetes: Configure PodDisruptionBudget
https://kubernetes.io/docs/tasks/run-application/configure-pdb/
Kubernetes: Safely Drain a Node
https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/
Kubernetes: API-initiated Eviction
https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/
kubectl drain reference
https://kubernetes.io/docs/reference/kubectl/generated/kubectl_drain/
Kubernetes: Deployments (rolling update semantics)
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/

Kubernetes PodDisruptionBudget + Eviction API Playbook

Kubernetes PodDisruptionBudget + Eviction API Playbook

Why this matters

1) Mental model: three different “unavailability budgets”

2) Ground rules (non-negotiable)

Rule A — Never use “zero voluntary eviction” by accident

Rule B — Prefer maxUnavailable for elastic workloads

Rule C — Set unhealthy pod eviction behavior intentionally

Rule D — Use Eviction API path for maintenance

3) PDB design recipes by workload type

A) Stateless API / web tier

B) Quorum systems (etcd, ZooKeeper-like)

C) Singleton stateful workload

D) Batch / replaceable jobs

4) Interlock: Deployment rolling update vs PDB

5) Node drain runbook (production-safe)

Preflight

Drain

During drain

Restore

6) Failure modes and fast fixes

Symptom: drain hangs forever

Symptom: Eviction API returns 429 repeatedly

Symptom: Eviction API returns 500

7) Baseline manifests

PDB for stateless service

Deployment alignment example

8) Operational SLO checks (recommended)

9) Compact checklist

References

Rule B — Prefer `maxUnavailable` for elastic workloads