Continuous Profiling in Production: eBPF + Agent Hybrid Playbook
Date: 2026-03-20
Category: knowledge
Goal: Turn continuous profiling from a "nice graph" into an operator-grade loop for latency/cost/reliability.
1) Why this matters
Ad-hoc profiling catches local bugs. It misses production-only behavior:
- real traffic mix,
- true contention patterns,
- periodic regressions after deploys,
- tail-latency cost concentration.
Continuous profiling gives time-indexed code-level evidence so you can answer:
- What got slower?
- When did it start?
- Which line/function owns the new cost?
Grafana’s docs explicitly frame this as low-overhead production sampling (not one-off debugging), with typical sampling-profiler overhead guidance around ~2–5% depending on settings and environment.
2) Profiling model (mental model)
Treat profiling as a control loop, not a dashboard:
- Collect (sampling, low overhead)
- Attribute (service/version/zone/pod/build labels)
- Compare (before vs after; baseline vs canary; healthy vs degraded window)
- Act (optimize code/config/runtime)
- Guard (SLO + rollback if regressions exceed budget)
If you only collect and never wire steps 3–5, you get observability theater.
3) Data-plane architecture choices
A. Language-agent profiling
Best for: rich runtime types (CPU + heap + alloc + mutex/block, etc.)
- Pros: deeper runtime-specific profile types.
- Cons: per-language rollout, sometimes code/runtime config changes required.
B. eBPF host-level profiling
Best for: broad fleet coverage with minimal app changes.
- Pros: host-wide visibility, no app redeploy for base CPU profiling.
- Cons: Linux-only, root/privileged constraints, profile-type limitations (often CPU-first).
Grafana Alloy docs clearly note eBPF tradeoffs: Linux scope, root requirement, and that not all profile types (e.g., memory/lock) are covered the same way.
C. Hybrid (recommended default)
- eBPF for broad baseline CPU visibility across fleet.
- Language agents for high-value services requiring deeper memory/lock/allocation analysis.
This keeps onboarding friction low while preserving depth where needed.
4) Practical sampling policy
Sampling rate is a budget decision.
Parca’s agent design documents its well-known default of 19 Hz per logical CPU (prime rate choice to reduce accidental periodic aliasing with app loops). That’s a useful baseline if you don’t have prior tuning data.
Suggested initial profile policy
- CPU sampling: start at conservative default (e.g., Parca-style 19 Hz equivalent).
- Profiling window granularity: 10s–60s collection chunks for comparison.
- Retention tiers:
- hi-res: 24–72h
- medium: 14d
- coarse rollups: 30–90d (or longer if cost allows)
Hard budgets (set before rollout)
- Profiler overhead budget (CPU): e.g., <= 3% steady-state (team-defined)
- Storage growth budget: e.g., <= X GB/day/service
- Query latency SLO for profile UI/API
No budget = infinite data gravity.
5) Rollout ladder (safe)
- Lab: one service, one node pool, synthetic load.
- Canary: 1–5% production pods/nodes.
- Segment: low-criticality services first.
- Core fleet: roll by environment + region.
- Deep-profile opt-in: language-specific memory/lock types for high-value services.
Promotion gates per stage
- p95/p99 latency delta within budget
- CPU overhead within budget
- no crash/restart increase
- no profiling pipeline backpressure
If any gate fails, freeze rollout and downsample before proceeding.
6) Labeling contract (non-negotiable)
Minimum labels for useful diffing:
serviceversion(git SHA/build ID)env(prod/stage)region/zonenode/podruntime+language
Without strong labels, you can’t isolate regressions from traffic-mix noise.
7) Alerting: profile-informed, not profile-noisy
Avoid alerting directly on flame graphs. Alert on derived guardrails:
- top function CPU share drift (e.g., +N sigma vs baseline)
- new hot path emergence (% total CPU > threshold for M minutes)
- symbolization failure ratio spikes
- ingestion lag / dropped profiles
Then use profiles for diagnosis after alert fire.
8) Common failure modes
"Always-on" but no compare workflow
- Fix: enforce before/after profile diff in performance PR template.
Over-sampling everything
- Fix: service-tiered sampling + retention tiers.
eBPF-only dogma
- Fix: hybrid approach for memory/lock depth.
Poor symbol hygiene
- Fix: build IDs + debug info pipeline + symbol upload verification.
Security blind spot (privileged agents)
- Fix: least privilege, dedicated node pools, explicit approval for root-required collectors.
9) 30-day implementation blueprint
Week 1: Baseline
- Select profiler stack (Pyroscope/Parca/etc.).
- Define overhead/storage/query SLO budgets.
- Establish labels and retention policy.
Week 2: Canary
- Enable CPU profiling for 1–2 candidate services.
- Validate overhead + ingestion + symbolization quality.
Week 3: Diff workflow
- Add regression diff checks to release review.
- Create top-5 hotpath weekly report.
Week 4: Expand + deepen
- Expand coverage by service tier.
- Enable richer profile types (heap/alloc/mutex) where supported and valuable.
Deliverable: one page per service with Top Hot Paths, Costed Fix Candidates, Expected Savings, Owner.
10) KPI pack (operator view)
- Profiler Overhead % (CPU, per service/tier)
- Profile Coverage % (services/pods with valid profiles)
- Symbolization Success %
- Top-Hotpath Stability Index (week-over-week drift)
- Optimization Yield (infra cost or latency saved per accepted profile-driven change)
- MTTR improvement for performance incidents
If optimization yield is flat for multiple cycles, your profiling is collecting data but not changing decisions.
11) References
Grafana Pyroscope docs — What is continuous profiling?
https://grafana.com/docs/pyroscope/latest/introduction/continuous-profiling/Grafana docs — Set up profiling with eBPF with Grafana Alloy
https://grafana.com/docs/pyroscope/latest/configure-client/grafana-alloy/ebpf/Parca Agent repository README
https://github.com/parca-dev/parca-agentParca docs — Parca Agent Design (19 Hz rationale, BPF CO-RE, data flow)
https://www.parca.dev/docs/parca-agent-design/Google Cloud Profiler overview (collection cadence + amortized overhead examples)
https://docs.cloud.google.com/profiler/docs/about-profilerPixie blog — Building a Continuous Profiler Part 2 (eBPF sampling mechanics and symbolization overhead discussion)
https://blog.px.dev/cpu-profiling-2/
TL;DR
Use hybrid continuous profiling:
- eBPF for wide CPU coverage,
- language agents for deep runtime insight,
- strict overhead/storage budgets,
- rollout gates,
- and a mandatory before/after diff workflow tied to release decisions.
That’s when profiling becomes a production control system, not just another graph.