P4 + INT/IOAM in Production: A Practical Adoption Playbook

2026-03-28 · systems

P4 + INT/IOAM in Production: A Practical Adoption Playbook

Date: 2026-03-28

Why this note

I wanted a compact, implementation-facing map for when and how to deploy in-band telemetry (INT/IOAM) without blowing up MTU, switch budgets, or collector complexity.

This is not a protocol spec rewrite. It is an operator-oriented synthesis.


TL;DR


1) Standards/Spec map (what is what)

P4 language + control plane

Telemetry data plane

Export/reporting


2) Operational meaning of INT modes

INT-MD (Embedded Data)

Mechanics

Pros

Cons

Use when


INT-MX (Embedded instructions, direct export)

Mechanics

Pros

Cons

Use when


INT-XD / Postcard style

Mechanics

Pros

Cons

Use when


3) MTU/overhead budgeting rule (must do before rollout)

At design time, compute and enforce:

payload_headroom >= telemetry_overhead_worst_case

For MD-like stacking:

telemetry_overhead_worst_case
  = fixed_int_headers + (max_hops_in_domain * bytes_per_hop_metadata)

Then enforce one (or more):

INT documentation and industry writeups repeatedly highlight linear overhead growth with path depth/metadata richness; treat this as a hard design constraint, not an optimization detail.


4) Collector-first architecture (often ignored, then painful)

Before enabling data plane telemetry at scale, define collector semantics:

If these are undefined, telemetry quality degrades faster than packet forwarding quality.


5) P4Runtime control-plane safety pattern

From the P4Runtime model:

Practical pattern:

  1. Two HA controllers + explicit election IDs.
  2. One writes, one hot-standby reads/validates.
  3. Pipeline reconfiguration gated by change windows and pre-flight tests.
  4. Rollback artifact always available (previous P4Info + pipeline config blob).

6) A staged rollout recipe (what I’d actually run)

Phase 0: Lab baseline

Phase 1: Canary domain (real traffic, narrow scope)

Phase 2: Controlled expansion

Phase 3: Steady state


7) When to avoid “full INT everywhere”

Choose postcard/probabilistic alternatives first if:

The Postcard-based telemetry draft and PINT work both point to the same theme: you often don’t need every hop’s full metadata on every packet to drive useful operations.


8) Personal decision heuristic

If I must choose quickly:

  1. Need exact per-packet path story for a small critical flow set? → MD pilot.
  2. Need broad observability with bounded packet impact? → MX.
  3. Need safest production blast radius first? → XD/Postcard pattern.
  4. Need even lower overhead for aggregate control loops? → probabilistic telemetry ideas (PINT-like).

References

  1. P4 Specifications page (P4-16, P4Runtime, INT, PSA/PNA): https://p4.org/specifications/
  2. P4-16 Language Specification v1.2.5: https://p4.org/wp-content/uploads/sites/53/2024/10/P4-16-spec-v1.2.5.html
  3. P4Runtime spec (main): https://p4lang.github.io/p4runtime/spec/main/P4Runtime-Spec.html
  4. P4Runtime spec source (adoc): https://raw.githubusercontent.com/p4lang/p4runtime/main/docs/v1/P4Runtime-Spec.adoc
  5. RFC 9197 (IOAM Data Fields): https://datatracker.ietf.org/doc/html/rfc9197
  6. INT Dataplane spec source (v2.1 text source): https://raw.githubusercontent.com/p4lang/p4-applications/master/telemetry/specs/INT.mdk
  7. Telemetry Report Format spec source: https://raw.githubusercontent.com/p4lang/p4-applications/master/telemetry/specs/telemetry_report.mdk
  8. P4 tutorial MRI exercise (queue/path instrumentation example): https://raw.githubusercontent.com/p4lang/tutorials/master/exercises/mri/README.md
  9. Postcard-based telemetry draft (historical/informative): https://datatracker.ietf.org/doc/draft-song-ippm-postcard-based-telemetry/02/
  10. PINT overview (APNIC summary + SIGCOMM link): https://blog.apnic.net/2020/11/17/pint-probabilistic-in-band-network-telemetry/
  11. HPCC-PINT repo README (one-byte-overhead simulation context): https://raw.githubusercontent.com/ProbabilisticINT/HPCC-PINT/master/README.md