CRIU Checkpoint/Restore in Production — Operator Playbook

2026-03-30 · software

CRIU Checkpoint/Restore in Production — Operator Playbook

Date: 2026-03-30
Category: knowledge
Audience: platform / SRE / container runtime operators

1) Why this matters now

Checkpoint/Restore In Userspace (CRIU) is no longer just a niche migration trick. In practical operations, it is becoming a useful primitive for:

The critical shift: checkpointing is now exposed in mainstream orchestrator/runtime surfaces, so operators can build repeatable procedures instead of one-off experiments.


2) Ecosystem status (practical view)

Kubernetes surface

Kubernetes documents a kubelet checkpoint API (POST /checkpoint/{namespace}/{pod}/{container}), with docs indicating v1.30 beta (enabled by default).

Operational implication:

Runtime surface

Podman surface

Podman’s checkpoint tooling is operationally mature enough for lab-to-prod pilots:

That makes Podman a strong “proving ground” before wiring equivalent flows into cluster automation.


3) The real production constraints (where projects fail)

3.1 Kernel/runtime capability mismatch

Most failed pilots are not architecture failures; they are compatibility failures:

Rule: treat checkpoint/restore like ABI-sensitive migration, not like generic image portability.

3.2 Network continuity assumptions

CRIU restore of TCP state depends on destination network conditions matching expected bindings/routes.

3.3 Filesystem and volume truth

A checkpoint can restore process memory, but correctness still depends on filesystem reality:

3.4 Security blast radius

Checkpoint archives are memory snapshots on disk. This means secrets and private material can be present in extractable form.

Treat checkpoint artifacts as high-sensitivity secrets, not ordinary logs.


4) Security baseline you should require

Before any broad rollout, enforce all of the following:

  1. Root-only artifact access on node checkpoint paths.
  2. Encryption at rest for checkpoint storage backends.
  3. TTL and deletion policy (short by default for forensic/ops snapshots).
  4. Transfer controls (signed, encrypted movement between nodes/registries).
  5. Strict RBAC for checkpoint API invocations.
  6. Audit trails: who triggered checkpoint, which workload, where artifact moved.
  7. Sandbox-only restore for forensic workflows unless explicitly approved.

If any one is missing, you are effectively creating a memory exfiltration pipeline.


5) Rollout strategy that actually works

Phase 0 — Capability inventory

Build a compatibility matrix per node pool:

Run criu check --all-style validation in CI for your node image pipeline.

Phase 1 — Single workload class pilot

Pick one low-risk but stateful service and define explicit success criteria:

Phase 2 — Controlled migration patterns

Start with planned maintenance migrations where rollback is easy.

Add iterative/pre-copy approaches when memory footprint is high and freeze budget is tight.

Phase 3 — Forensics track

Split operational and forensic tracks:

Do not merge artifact retention or access policies between these tracks.


6) Metrics to monitor (minimum set)

Track these per runtime version and workload class:

Without these metrics, you cannot distinguish “works in demo” from “safe in production.”


7) Failure modes and immediate responses

  1. Restore fails on destination
    Action: fallback to cold start path, preserve logs/artifacts, mark node/runtime tuple incompatible.

  2. Network state restore instability
    Action: disable established-TCP restore mode for that workload class; use reconnect-aware application strategy.

  3. Checkpoint artifact growth runaway
    Action: enforce quotas + TTL pruning + compression policy review.

  4. Security concern around memory artifacts
    Action: halt non-essential checkpoint creation, rotate secrets that may have been exposed, audit artifact access logs.


8) Practical decision framework

Use checkpoint/restore when all are true:

Avoid it when:


9) Bottom line

CRIU-based checkpoint/restore is becoming a practical operator tool, but only for teams that treat it as a controlled systems capability, not as a magic mobility switch.

The winning posture is simple:

If you do those four, checkpoint/restore can move from “cool demo” to reliable production primitive.


References

  1. Kubernetes Docs — Kubelet Checkpoint API
    https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/
  2. Kubernetes Blog — Forensic container checkpointing in Kubernetes (historical context, alpha-era)
    https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/
  3. CRIU Wiki — Kubernetes integration notes
    https://criu.org/Kubernetes
  4. CRIU Wiki — Live migration
    https://criu.org/Live_migration
  5. CRIU Wiki — Iterative migration
    https://criu.org/Iterative_migration
  6. CRIU Wiki — Lazy migration
    https://criu.org/Lazy_migration
  7. Podman Docs — Checkpoint overview
    https://podman.io/docs/checkpoint
  8. Podman Manpage — podman container checkpoint
    https://docs.podman.io/en/latest/markdown/podman-container-checkpoint.1.html
  9. CRIU Man Page (example distro mirror) — operational options and caveats
    https://man.archlinux.org/man/criu.8.en