TCP RACK-TLP Loss Recovery — Production Adoption Playbook

Date: 2026-03-27
Category: knowledge
Audience: platform / SRE / network engineers running latency-sensitive TCP services

1) Why this matters

Classic TCP loss detection (DupAck-threshold style) works well for steady large flights, but often underperforms in real production patterns:

short RPC-style exchanges,
application-limited bursts,
tail-packet loss near end of response,
moderate packet reordering.

That pattern usually appears as: few packets lost -> timeout path triggered -> latency tail explodes.

RACK-TLP (RFC 8985) is designed to reduce that failure mode:

RACK: time-based loss inference using ACK/SACK timing,
TLP: sends a probe to trigger ACK feedback and avoid waiting for full RTO.

In plain English: recover at RTT timescale more often, and fall back to RTO less often.

2) Practical effect you should expect

When rollout is healthy, you typically see:

lower RTO-driven recoveries,
more fast-recovery events instead of timeout recovery,
tighter p95/p99 latency for small/medium responses,
fewer long-tail retries at app layer.

Do not expect magic throughput gains everywhere. The biggest win is usually tail-behavior stability.

3) Mental model (operator version)

3.1 RACK

RACK treats loss as a time inference, not just “did we receive 3 duplicate ACKs?”.

If newer data is acknowledged and an older segment remains unacked past a reordering allowance window, that old segment is inferred lost.

Why this helps:

short flights that cannot generate enough DupAcks still get timely loss detection,
lost retransmissions are easier to detect,
moderate reordering is tolerated without over-triggering spurious recovery.

3.2 TLP

When ACKs are sparse near tail loss, TLP sends a probe segment to elicit ACK feedback quickly, converting many would-be timeout recoveries into fast recovery paths.

This directly attacks one of the most expensive latency branches: “last packet lost -> wait for RTO”.

4) Linux knobs to know (and verify per kernel)

Kernel behavior evolves. Always check your exact kernel docs/version before automation.

From Linux ip-sysctl docs:

net.ipv4.tcp_recovery (bitmap)
- 0x1: enables RACK loss detection (default in docs)
- 0x2: static reordering window (min_rtt/4)
- 0x4: disables RACK DupAck-threshold heuristic
net.ipv4.tcp_early_retrans
- 0: disables TLP
- 3 or 4: enables TLP (docs show default 3)
- note: docs explicitly state TLP requires RACK

Useful runtime checks:

sysctl net.ipv4.tcp_recovery
sysctl net.ipv4.tcp_early_retrans
sysctl net.ipv4.tcp_reordering
sysctl net.ipv4.tcp_max_reordering

5) Observability: what to dashboard before rollout

At minimum, capture these by service + region + path class:

RTO rate (timeouts per connection or per 1k transactions),
fast-recovery rate,
retransmission rate (and spurious retrans indicators if available),
tail latency (p95/p99/p99.9),
app-level retry rate / timeout rate,
reordering indicators (if your telemetry exports them).

Two useful derived metrics:

Timeout Share = RTO recoveries / total recoveries
Tail-Repair Gain = (p99_before - p99_after) / p99_before

If Timeout Share drops while p99 improves and error budget stays stable, rollout is likely on the right track.

6) Rollout plan (low-regret)

Phase A — Baseline first (3-7 days)

Freeze current recovery settings.
Record weekday/weekend + peak/off-peak baselines.
Segment by traffic shape (short RPC vs bulk stream).

Phase B — Narrow canary (5-10%)

Start with latency-sensitive but blast-radius-limited services.
Keep strict control group.
Compare not only medians but tail and timeout counters.

Phase C — Expand by topology

Expand only if all hold:

RTO rate improves or flat,
p99 improved or flat,
app retries/timeouts not worsening,
no new instability in reorder-heavy paths.

Phase D — Full rollout + guardrail automation

Set rollback triggers as policy (not ad-hoc judgment):

RTO rate > X% above baseline for Y minutes,
p99 > X% above baseline with confidence gates,
app timeout budget breach.

7) Common failure modes

Assuming all kernels behave identically
Recovery internals vary by version/vendor backport.
Skipping path segmentation
ECMP / wireless / cross-region paths can have different reordering behavior.
Calling it a success from median latency only
RACK-TLP value is mostly in tails and timeout avoidance.
Changing congestion control + loss recovery together
Hard to attribute wins/regressions; split experiments.
No app-layer correlation
TCP-level improvements should reflect in retry/timeouts/SLOs; if not, look for app bottlenecks.

8) Quick incident triage checklist

When p99 suddenly worsens and network loss is suspected:

Check tcp_recovery / tcp_early_retrans drift vs expected config.
Compare RTO share vs previous healthy window.
Slice by AZ/region/ISP/path to isolate topology-driven reordering/loss domains.
Inspect app timeout and retry bursts (transport issue should echo at app layer).
If needed, rollback canary scope first, not whole fleet immediately.

9) Bottom line

RACK-TLP is one of the highest-leverage TCP tail-latency stabilizers for modern RPC traffic: it wins mainly by converting expensive timeout recovery into faster ACK-driven recovery.

References

RFC 8985 — The RACK-TLP Loss Detection Algorithm for TCP
https://www.rfc-editor.org/rfc/rfc8985
Linux Kernel docs — IP Sysctl (tcp_recovery, tcp_early_retrans)
https://docs.kernel.org/networking/ip-sysctl.html
RFC 6675 — SACK-based Loss Recovery Algorithm for TCP
https://www.rfc-editor.org/rfc/rfc6675
RFC 6298 — Computing TCP’s Retransmission Timer
https://www.rfc-editor.org/rfc/rfc6298
RFC 5681 — TCP Congestion Control
https://www.rfc-editor.org/rfc/rfc5681