CRIU Checkpoint/Restore in Production — Operator Playbook
Date: 2026-03-30
Category: knowledge
Audience: platform / SRE / container runtime operators
1) Why this matters now
Checkpoint/Restore In Userspace (CRIU) is no longer just a niche migration trick. In practical operations, it is becoming a useful primitive for:
- forensic snapshots of suspicious workloads,
- stateful container relocation with shorter warmup costs,
- maintenance windows where restart-from-zero is too expensive,
- incident replay in isolated sandboxes.
The critical shift: checkpointing is now exposed in mainstream orchestrator/runtime surfaces, so operators can build repeatable procedures instead of one-off experiments.
2) Ecosystem status (practical view)
Kubernetes surface
Kubernetes documents a kubelet checkpoint API (POST /checkpoint/{namespace}/{pod}/{container}), with docs indicating v1.30 beta (enabled by default).
Operational implication:
- your first checkpoint interface is node-local kubelet, not a high-level "kubectl checkpoint" workflow,
- checkpoint archive handling (storage, encryption, access controls) becomes part of node security posture.
Runtime surface
- CRI-O / containerd / runc / crun are the key path pieces around CRIU in container stacks.
- Capability and UX differ by versions and distro packaging.
- Support for “checkpoint image” restore flows is advancing, but do not assume every cluster/runtime combination is symmetric for export/import/restore.
Podman surface
Podman’s checkpoint tooling is operationally mature enough for lab-to-prod pilots:
- export/import checkpoint archives,
- checkpoint OCI image creation,
- pre-checkpoint + with-previous flows,
- stats output (freeze/memdump/memwrite timing).
That makes Podman a strong “proving ground” before wiring equivalent flows into cluster automation.
3) The real production constraints (where projects fail)
3.1 Kernel/runtime capability mismatch
Most failed pilots are not architecture failures; they are compatibility failures:
- missing kernel features required by CRIU,
- seccomp/cgroup/namespace behavior differences,
- runtime flags not enabled,
- source/destination kernel drift.
Rule: treat checkpoint/restore like ABI-sensitive migration, not like generic image portability.
3.2 Network continuity assumptions
CRIU restore of TCP state depends on destination network conditions matching expected bindings/routes.
- If destination cannot satisfy original socket context, restore fails or degrades.
- External connections, in-flight state, and namespace boundary behavior need explicit handling policy.
3.3 Filesystem and volume truth
A checkpoint can restore process memory, but correctness still depends on filesystem reality:
- shared/distributed storage consistency,
- mounted volume parity,
- rootfs diff inclusion policy when exporting.
3.4 Security blast radius
Checkpoint archives are memory snapshots on disk. This means secrets and private material can be present in extractable form.
Treat checkpoint artifacts as high-sensitivity secrets, not ordinary logs.
4) Security baseline you should require
Before any broad rollout, enforce all of the following:
- Root-only artifact access on node checkpoint paths.
- Encryption at rest for checkpoint storage backends.
- TTL and deletion policy (short by default for forensic/ops snapshots).
- Transfer controls (signed, encrypted movement between nodes/registries).
- Strict RBAC for checkpoint API invocations.
- Audit trails: who triggered checkpoint, which workload, where artifact moved.
- Sandbox-only restore for forensic workflows unless explicitly approved.
If any one is missing, you are effectively creating a memory exfiltration pipeline.
5) Rollout strategy that actually works
Phase 0 — Capability inventory
Build a compatibility matrix per node pool:
- kernel version/features,
- runtime (runc/crun/containerd/CRI-O) versions,
- CRIU versions,
- cgroup mode,
- security profiles (seccomp/AppArmor/SELinux).
Run criu check --all-style validation in CI for your node image pipeline.
Phase 1 — Single workload class pilot
Pick one low-risk but stateful service and define explicit success criteria:
- checkpoint creation p95 latency,
- restore success rate,
- warmup time saved vs cold restart,
- post-restore correctness checks.
Phase 2 — Controlled migration patterns
Start with planned maintenance migrations where rollback is easy.
Add iterative/pre-copy approaches when memory footprint is high and freeze budget is tight.
Phase 3 — Forensics track
Split operational and forensic tracks:
- operational checkpoint (availability goal),
- forensic checkpoint (investigation goal).
Do not merge artifact retention or access policies between these tracks.
6) Metrics to monitor (minimum set)
Track these per runtime version and workload class:
- checkpoint success/failure counts,
- freeze time distribution (p50/p95/p99),
- memdump/memwrite timing,
- archive size distribution,
- restore success + time-to-readiness,
- post-restore error rate / latency regression,
- rollback frequency,
- artifact age and deletion SLA compliance.
Without these metrics, you cannot distinguish “works in demo” from “safe in production.”
7) Failure modes and immediate responses
Restore fails on destination
Action: fallback to cold start path, preserve logs/artifacts, mark node/runtime tuple incompatible.Network state restore instability
Action: disable established-TCP restore mode for that workload class; use reconnect-aware application strategy.Checkpoint artifact growth runaway
Action: enforce quotas + TTL pruning + compression policy review.Security concern around memory artifacts
Action: halt non-essential checkpoint creation, rotate secrets that may have been exposed, audit artifact access logs.
8) Practical decision framework
Use checkpoint/restore when all are true:
- restart cost is materially high,
- workload state is expensive to reconstruct,
- compatibility matrix is green,
- security controls for artifacts are in place,
- rollback to cold start is tested.
Avoid it when:
- workloads are already stateless/fast to restart,
- runtime/kernel heterogeneity is high and unmanaged,
- team has no artifact security process,
- success metrics are undefined.
9) Bottom line
CRIU-based checkpoint/restore is becoming a practical operator tool, but only for teams that treat it as a controlled systems capability, not as a magic mobility switch.
The winning posture is simple:
- strict compatibility gates,
- explicit security ownership of checkpoint artifacts,
- phased rollout with hard metrics,
- always-tested cold-start rollback.
If you do those four, checkpoint/restore can move from “cool demo” to reliable production primitive.
References
- Kubernetes Docs — Kubelet Checkpoint API
https://kubernetes.io/docs/reference/node/kubelet-checkpoint-api/ - Kubernetes Blog — Forensic container checkpointing in Kubernetes (historical context, alpha-era)
https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/ - CRIU Wiki — Kubernetes integration notes
https://criu.org/Kubernetes - CRIU Wiki — Live migration
https://criu.org/Live_migration - CRIU Wiki — Iterative migration
https://criu.org/Iterative_migration - CRIU Wiki — Lazy migration
https://criu.org/Lazy_migration - Podman Docs — Checkpoint overview
https://podman.io/docs/checkpoint - Podman Manpage — podman container checkpoint
https://docs.podman.io/en/latest/markdown/podman-container-checkpoint.1.html - CRIU Man Page (example distro mirror) — operational options and caveats
https://man.archlinux.org/man/criu.8.en