TCP RACK-TLP Loss Recovery β Production Adoption Playbook
Date: 2026-03-27
Category: knowledge
Audience: platform / SRE / network engineers running latency-sensitive TCP services
1) Why this matters
Classic TCP loss detection (DupAck-threshold style) works well for steady large flights, but often underperforms in real production patterns:
- short RPC-style exchanges,
- application-limited bursts,
- tail-packet loss near end of response,
- moderate packet reordering.
That pattern usually appears as: few packets lost -> timeout path triggered -> latency tail explodes.
RACK-TLP (RFC 8985) is designed to reduce that failure mode:
- RACK: time-based loss inference using ACK/SACK timing,
- TLP: sends a probe to trigger ACK feedback and avoid waiting for full RTO.
In plain English: recover at RTT timescale more often, and fall back to RTO less often.
2) Practical effect you should expect
When rollout is healthy, you typically see:
- lower RTO-driven recoveries,
- more fast-recovery events instead of timeout recovery,
- tighter p95/p99 latency for small/medium responses,
- fewer long-tail retries at app layer.
Do not expect magic throughput gains everywhere. The biggest win is usually tail-behavior stability.
3) Mental model (operator version)
3.1 RACK
RACK treats loss as a time inference, not just βdid we receive 3 duplicate ACKs?β.
If newer data is acknowledged and an older segment remains unacked past a reordering allowance window, that old segment is inferred lost.
Why this helps:
- short flights that cannot generate enough DupAcks still get timely loss detection,
- lost retransmissions are easier to detect,
- moderate reordering is tolerated without over-triggering spurious recovery.
3.2 TLP
When ACKs are sparse near tail loss, TLP sends a probe segment to elicit ACK feedback quickly, converting many would-be timeout recoveries into fast recovery paths.
This directly attacks one of the most expensive latency branches: βlast packet lost -> wait for RTOβ.
4) Linux knobs to know (and verify per kernel)
Kernel behavior evolves. Always check your exact kernel docs/version before automation.
From Linux ip-sysctl docs:
net.ipv4.tcp_recovery(bitmap)0x1: enables RACK loss detection (default in docs)0x2: static reordering window (min_rtt/4)0x4: disables RACK DupAck-threshold heuristic
net.ipv4.tcp_early_retrans0: disables TLP3or4: enables TLP (docs show default3)- note: docs explicitly state TLP requires RACK
Useful runtime checks:
sysctl net.ipv4.tcp_recovery
sysctl net.ipv4.tcp_early_retrans
sysctl net.ipv4.tcp_reordering
sysctl net.ipv4.tcp_max_reordering
Recommended posture:
- keep baseline defaults first,
- change one recovery-related knob at a time,
- canary by service tier and path class before global rollout.
5) Observability: what to dashboard before rollout
At minimum, capture these by service + region + path class:
- RTO rate (timeouts per connection or per 1k transactions),
- fast-recovery rate,
- retransmission rate (and spurious retrans indicators if available),
- tail latency (p95/p99/p99.9),
- app-level retry rate / timeout rate,
- reordering indicators (if your telemetry exports them).
Two useful derived metrics:
- Timeout Share =
RTO recoveries / total recoveries - Tail-Repair Gain =
(p99_before - p99_after) / p99_before
If Timeout Share drops while p99 improves and error budget stays stable, rollout is likely on the right track.
6) Rollout plan (low-regret)
Phase A β Baseline first (3-7 days)
- Freeze current recovery settings.
- Record weekday/weekend + peak/off-peak baselines.
- Segment by traffic shape (short RPC vs bulk stream).
Phase B β Narrow canary (5-10%)
- Start with latency-sensitive but blast-radius-limited services.
- Keep strict control group.
- Compare not only medians but tail and timeout counters.
Phase C β Expand by topology
Expand only if all hold:
- RTO rate improves or flat,
- p99 improved or flat,
- app retries/timeouts not worsening,
- no new instability in reorder-heavy paths.
Phase D β Full rollout + guardrail automation
Set rollback triggers as policy (not ad-hoc judgment):
- RTO rate > X% above baseline for Y minutes,
- p99 > X% above baseline with confidence gates,
- app timeout budget breach.
7) Common failure modes
Assuming all kernels behave identically
Recovery internals vary by version/vendor backport.Skipping path segmentation
ECMP / wireless / cross-region paths can have different reordering behavior.Calling it a success from median latency only
RACK-TLP value is mostly in tails and timeout avoidance.Changing congestion control + loss recovery together
Hard to attribute wins/regressions; split experiments.No app-layer correlation
TCP-level improvements should reflect in retry/timeouts/SLOs; if not, look for app bottlenecks.
8) Quick incident triage checklist
When p99 suddenly worsens and network loss is suspected:
- Check
tcp_recovery/tcp_early_retransdrift vs expected config. - Compare RTO share vs previous healthy window.
- Slice by AZ/region/ISP/path to isolate topology-driven reordering/loss domains.
- Inspect app timeout and retry bursts (transport issue should echo at app layer).
- If needed, rollback canary scope first, not whole fleet immediately.
9) Bottom line
RACK-TLP is one of the highest-leverage TCP tail-latency stabilizers for modern RPC traffic: it wins mainly by converting expensive timeout recovery into faster ACK-driven recovery.
References
- RFC 8985 β The RACK-TLP Loss Detection Algorithm for TCP
https://www.rfc-editor.org/rfc/rfc8985 - Linux Kernel docs β IP Sysctl (
tcp_recovery,tcp_early_retrans)
https://docs.kernel.org/networking/ip-sysctl.html - RFC 6675 β SACK-based Loss Recovery Algorithm for TCP
https://www.rfc-editor.org/rfc/rfc6675 - RFC 6298 β Computing TCPβs Retransmission Timer
https://www.rfc-editor.org/rfc/rfc6298 - RFC 5681 β TCP Congestion Control
https://www.rfc-editor.org/rfc/rfc5681