Kubernetes NodeLocal DNSCache + CoreDNS Scaling Playbook
How to stop DNS from becoming your hidden latency tax and intermittent outage source.
Why this matters
In many clusters, DNS is treated as “just plumbing” until one of these happens:
- p95 app latency spikes without obvious CPU or DB saturation
- random timeout storms appear during traffic peaks
- node drains/upgrades trigger a wave of
UnknownHost/SERVFAIL - CoreDNS CPU and restarts oscillate with cluster size
In practice, DNS failures are often fan-out multipliers: one small resolver issue can hit every service path.
This playbook gives an operator-focused pattern for:
- scaling CoreDNS safely,
- reducing cross-node DNS hops via NodeLocal DNSCache,
- tuning cache behavior to reduce backend load without serving stale answers forever.
1) Mental model: where DNS latency actually comes from
Without NodeLocal DNSCache, a pod’s DNS request usually goes:
Pod -> kube-dns Service IP -> kube-proxy translation -> CoreDNS pod (often remote node) -> upstream
With NodeLocal DNSCache:
Pod -> node-local-dns on same node -> (cache hit: done) OR (miss: CoreDNS/upstream)
Operational implications:
- fewer iptables/IPVS and conntrack side effects on hot paths
- fewer cross-node round trips for repeated names
- more stable latency under bursty name resolution patterns
- node-level DNS metrics become visible (not just cluster aggregate)
2) First principles for capacity planning
A) CoreDNS replicas (autoscaler)
Kubernetes DNS horizontal autoscaling commonly uses cluster-proportional-autoscaler (CPA).
Default linear model idea:
replicas = max( ceil(cores / coresPerReplica), ceil(nodes / nodesPerReplica) )
So:
- large-core clusters are dominated by
coresPerReplica - many-small-node clusters are dominated by
nodesPerReplica
Start conservative, then tune from observed saturation and SLOs.
B) CoreDNS memory sizing
A practical baseline from CoreDNS deployment guidance:
- default config estimate:
MB ~= (Pods + Services)/1000 + 54 - with
autopath:MB ~= (Pods + Services)/250 + 56
Treat these as starting priors, then calibrate on your workload/query mix.
C) NodeLocal DNSCache memory
NodeLocal runs per node (DaemonSet), so “small per-pod overhead” becomes cluster-wide overhead.
- default CoreDNS cache size (10k entries) is often ~30MB when full (per server block)
- query concurrency and cache policy can push usage higher
If NodeLocal pods OOMKill, you get brief DNS blackouts on affected nodes. Set realistic memory requests/limits from measured peaks.
3) Rollout strategy (safe sequence)
Phase 0 — Observe before changing
Collect at least:
- CoreDNS CPU/memory/restarts
- request rate and cache hit ratio
- DNS error mix (
NXDOMAIN,SERVFAIL, timeout) - app-side lookup latency and timeout counts
Phase 1 — Stabilize CoreDNS first
Before NodeLocal rollout, make CoreDNS stable:
- set CPA min replicas (avoid singleton)
- set requests/limits based on current QPS
- verify disruption settings for kube-system workloads
Phase 2 — Introduce NodeLocal DNSCache
- deploy manifest with a non-colliding local IP (link-local range is common)
- in IPVS mode, update kubelet
--cluster-dnsas required - canary on a subset of nodes first
Phase 3 — Tune cache behavior
- raise hit ratio without over-serving stale data
- tune negative cache behavior based on your NXDOMAIN profile
- validate tail latency and timeout reduction, not just average latency
4) CoreDNS cache tuning that usually works
CoreDNS cache plugin gives fine-grained controls:
success/denialcapacities and TTL floorsprefetchfor popular records before expiryserve_stalefor resilience during upstream blipsservfailcache duration (short)
Example pattern (illustrative):
cache 300 {
success 20000 300 5
denial 10000 60 5
prefetch 20 1m 20%
serve_stale 30s immediate
servfail 5s
}
Guidance:
- keep stale windows short unless you explicitly prioritize availability over freshness
- be careful with aggressive denial caching if service discovery state changes quickly
- avoid
keepttlfor recursive/caching use cases (it can propagate stale behavior downstream)
5) Query amplification trap: ndots and search domains
Typical pod resolv.conf includes search suffixes and options ndots:5.
Operationally, short external hostnames may trigger multiple suffix attempts before absolute resolution, amplifying QPS and NXDOMAIN volume.
Mitigations:
- use fully qualified names for high-QPS external dependencies
- where appropriate, consider trailing-dot absolute names for resolver-critical paths
- monitor NXDOMAIN volume before/after app DNS config changes
6) SLOs and alerts (minimum set)
Track these as first-class reliability signals:
- DNS lookup success rate (cluster and namespace critical paths)
- p95/p99 DNS latency (app-side + DNS-side)
- CoreDNS saturation (CPU throttling, memory pressure, restarts)
- NodeLocal pod OOM/restart rate
- cache hit ratio and stale-served rate
SERVFAILand timeout rate
If you only track request count and average latency, you will miss most real incidents.
7) Incident playbook snippets
Symptom: app timeouts spike, CoreDNS CPU pinned
Likely causes:
- sudden query amplification (retry storm, ndots/search expansion)
- insufficient CoreDNS replicas
- cache miss ratio jump due to low TTL/high cardinality names
Actions:
- scale CoreDNS replicas up immediately (temporary safety)
- verify CPA config and target deployment
- inspect top QNAMEs / NXDOMAIN contributors
- patch hot clients to reduce resolver churn
Symptom: NodeLocal pods restarting/OOMKilled
Likely causes:
- too-low memory limits for cache/query concurrency
- bursty per-node traffic patterns
Actions:
- raise request/limit and redeploy
- temporarily reduce cache capacity if required
- validate no prolonged stale-serve side effects
Symptom: intermittent SERVFAIL during upstream issues
Actions:
- ensure short
servfailcaching is enabled - use bounded
serve_staleto absorb transient upstream flaps - verify upstream resolver health and packet loss path
8) Anti-patterns to avoid
- Running CoreDNS as effectively singleton in production
- Enabling NodeLocal everywhere without memory headroom testing
- Tuning cache TTLs aggressively without stale/freshness policy
- Ignoring
NXDOMAINtrends (often early warning for app mis-resolution) - Treating DNS as “best effort” while holding strict app latency SLOs
9) Practical baseline checklist
- CoreDNS autoscaling enabled (CPA or equivalent)
- min replicas >= 2 for production clusters
- NodeLocal DNSCache canary tested and rolled out
- CoreDNS + NodeLocal resource limits calibrated from observed peak
- cache hit ratio, stale rate, SERVFAIL, timeout, NXDOMAIN on dashboards
- DNS-related alerts routed to oncall with runbook links
- app teams documented DNS naming practices (FQDN, retry behavior)
10) References
- Kubernetes: Using NodeLocal DNSCache in Kubernetes Clusters
https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/ - Kubernetes: Autoscale the DNS Service in a Cluster
https://kubernetes.io/docs/tasks/administer-cluster/dns-horizontal-autoscaling/ - Kubernetes: DNS for Services and Pods
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/ - CoreDNS cache plugin docs
https://coredns.io/plugins/cache/ - CoreDNS deployment sizing notes
https://github.com/coredns/deployment/blob/master/kubernetes/Scaling_CoreDNS.md
If you do only one thing: stabilize CoreDNS autoscaling and add NodeLocal DNSCache with measured memory limits. That single move usually removes a surprising amount of tail-latency and timeout noise.