Linux DAMON Proactive Reclamation and Access-Aware Memory Operations Playbook

Date: 2026-04-10
Category: knowledge
Domain: software / linux / memory management

Why this matters

Linux memory tuning often swings between two extremes:

too reactive: you only notice trouble once reclaim, swap, or OOM behavior is already hurting tail latency;
too blunt: you apply global knobs that know almost nothing about which memory is actually hot or cold.

DAMON matters because it gives Linux a middle path:

low-overhead data access monitoring,
region-based estimates of hotness and age,
and policy hooks for access-aware actions.

The useful mental model is:

DAMON is Linux’s “watch memory access patterns first, act second” subsystem.

That makes it interesting for:

memory-dense hosts with bursty pressure,
caches whose cold pages linger too long,
systems where page reclaim starts too late and gets too violent,
NUMA or tiered-memory experiments,
and operators who want something more informed than generic background reclaim.

It is not a magic replacement for the VM subsystem. It is a way to inject better recency/frequency information into memory decisions.

1) Quick mental model

DAMON stands for Data Access MONitoring and Access-aware System Operations.

At a high level, it does four things:

Observes a target address space.
Approximates access behavior by tracking regions instead of every page all the time.
Maintains age + access-frequency signals for those regions.
Feeds that information into actions such as page reclamation, LRU reprioritization, or other DAMOS schemes.

The key design idea is that full page-by-page observation is too expensive at scale. So DAMON uses:

sampling intervals,
aggregation intervals,
bounded region counts,
and adaptive split/merge of regions

so overhead stays controlled while the picture remains useful.

That trade-off is the whole point.

2) What DAMON is good at

Strong fits

A) Proactive reclaim before the cliff

If a machine alternates between calm periods and sudden pressure spikes, classic reclaim can wait too long and then become noisy.

DAMON is good when you want to identify memory that has been cold for a meaningful time window and reclaim it gently before the cliff.

B) Making LRU behavior more reality-based

LRU is a useful approximation, but real workloads are messy. Pages that look inactive from one signal may still matter; pages that linger on active lists may be dead weight.

DAMON can improve this by identifying hot/cold regions and using that information to promote or demote pages on LRU lists more intelligently.

C) Observation-led memory tuning

Before changing reclaim strategy, swap behavior, cache sizes, or tiering policy, DAMON can help answer:

what is actually hot?
what stays cold for minutes rather than milliseconds?
how large is the real working set?
how stable is that working set over time?

D) Research and platform engineering

DAMON is attractive for platform teams because it is both:

an operator tool, and
a kernel subsystem for building more specialized access-aware policies.

Weak fits

A) Instant, per-page, zero-error truth

DAMON is approximate by design. If you need exact accounting of every access, this is the wrong tool.

B) Tiny systems with simple pressure profiles

If your box is small and its behavior is obvious, the complexity may not buy much. Traditional VM tuning may be enough.

C) Severe pressure emergencies

DAMON-based reclaim is explicitly not meant to replace regular page-granularity reclaim under real pressure. It is for selective, lightweight, proactive work.

D) One-knob miracle tuning

DAMON gives you better information and better policy hooks. It does not spare you from threshold tuning, validation, or workload-specific judgment.

3) The architecture that actually matters to operators

The official docs describe three layers:

Operations set layer
Core logic
Modules

The operator version is simpler:

A) Operations sets: what address space are you watching?

DAMON supports multiple monitoring targets, including:

vaddr: virtual address spaces of specific processes
fvaddr: fixed virtual address ranges
paddr: system physical memory

If you care about whole-host proactive reclaim behavior, paddr is usually the most relevant mental model. If you care about understanding one workload’s memory pattern, vaddr is often the starting point.

B) kdamond: who does the work?

Each DAMON context is executed by a kernel thread named kdamond. Think of it as the monitoring/control worker for one configured monitoring context.

C) DAMOS: what do you do with the signal?

DAMON-based Operation Schemes (DAMOS) let you define:

target access pattern,
quotas,
watermarks,
filters,
and an action.

This is where DAMON turns from “interesting observability” into “memory policy.”

4) How DAMON keeps overhead under control

This is the most important design point.

DAMON does not monitor every page equally forever. It uses region-based sampling.

Core knobs

sampling interval: how often access is sampled
aggregation interval: how long samples are accumulated into one snapshot
update interval: how often dynamic target-space changes are reflected
minimum / maximum number of regions: the bounds on monitoring granularity and overhead

Why regions matter

Adjacent pages are grouped into regions that are assumed to have similar behavior. DAMON then checks representative pages and adaptively splits or merges regions based on observed access differences.

That means:

more regions -> potentially better fidelity, more overhead
fewer regions -> cheaper monitoring, more approximation

The clever bit is that DAMON tries to ride that trade-off automatically within the bounds you set.

Age is as important as frequency

Operators often focus only on “hotness,” but DAMON also tracks how long the current access pattern has persisted. That is valuable because:

short-term quiet pages are different from truly abandoned pages,
and low-frequency-but-stable memory can matter differently from newly cold memory.

In practice, cold age thresholds are usually one of the most important tuning levers.

5) The best way to think about monitoring output

A DAMON snapshot is basically telling you:

where in the address space a region lives,
how big it is,
how frequently it has been accessed during the aggregation window,
and how long that pattern has remained stable.

That is enough to estimate:

working-set shape,
hot/cold separation quality,
recency distribution,
and whether your thresholds are absurd or sane.

A good practical reading of DAMON data is not:

“this specific page is definitely cold.”

It is:

“this region has been consistently cold enough, long enough, that reclaiming or deprioritizing it is a defensible bet.”

That distinction matters.

6) DAMON_RECLAIM: where proactive reclaim fits

DAMON_RECLAIM is the operator-facing static kernel module for proactive reclamation.

The docs are very explicit about its role:

it is meant for proactive and lightweight reclamation,
especially under light memory pressure,
and it is not trying to replace normal LRU-list-based reclaim when pressure gets serious.

That is exactly how to use it.

When it shines

DAMON_RECLAIM is attractive when:

free memory is drifting down but not yet catastrophic,
cold cache or anonymous memory is accumulating,
reclaim storms hurt latency when they happen all at once,
and you want a small continuous tax instead of occasional violent stalls.

Why the watermarks are important

DAMON_RECLAIM uses watermarks so it can stay inactive when memory is plentiful, become active in a middle zone, and step back again when the system is under more serious pressure.

That last part is subtle and important.

The low watermark behavior means:

if the box is already too tight, let normal reclaim machinery take over instead of insisting that DAMON keep doing selective background cleanup.

That is good operational design.

Quotas are not optional in spirit

DAMON_RECLAIM provides:

time quota (quota_ms)
size quota (quota_sz)
quota reset interval
optional PSI-based memory-pressure targeting
optional feedback-based auto-tuning

The right mental model is:

quotas prevent background reclaim from becoming the new problem,
PSI targeting helps shape effort based on observed stall pressure,
and stats like nr_quota_exceeds tell you when your policy wants to do more than you are allowing.

If you enable DAMON_RECLAIM without thinking hard about quotas, you are skipping one of its main safety rails.

7) DAMON_LRU_SORT: fixing LRU ordering before reclaim happens

DAMON_LRU_SORT is different from proactive reclaim. It does not primarily reclaim pages itself. It tries to make the LRU lists better reflect real access behavior by:

prioritizing hot regions, and
deprioritizing cold regions.

This is interesting because reclaim quality depends heavily on whether the kernel’s page ordering is trustworthy.

If the access signal feeding the LRU lists is stale or misleading, reclaim decisions suffer. DAMON_LRU_SORT tries to improve the ordering upstream.

When I’d favor it

I would look at DAMON_LRU_SORT when:

reclaim is happening, but it seems to pick victims badly,
active/inactive balance looks wrong for the workload,
or you want to improve reclaim readiness without immediately forcing more pageout.

Useful knobs

The kernel docs highlight knobs such as:

hot_thres_access_freq
cold_min_age
active_mem_bp
filter_young_pages
quota and watermark settings
optional auto-tuned monitoring intervals

The important operational lesson is that DAMON_LRU_SORT is a steering mechanism, not a brute-force broom.

8) Observation-first rollout beats direct action-first rollout

This is the biggest practical advice.

Do not start by enabling automatic reclaim on a production-critical box just because DAMON exists.

Start in this order:

Stage 0 — Verify kernel capabilities

Confirm the system actually has the needed pieces:

CONFIG_DAMON_*
CONFIG_DAMON_SYSFS if you want sysfs control
and the specific static modules for DAMON_RECLAIM / DAMON_LRU_SORT if you plan to use them

Also confirm which operations sets are actually available on the running kernel.

Stage 1 — Observe only

Use DAMON or DAMO to inspect the workload without acting on it. You want to learn:

working-set size percentiles,
hot/cold region stability,
and whether the default aggregation interval is wildly wrong for your workload.

The docs explicitly note that the default 100 ms aggregation interval is too short in many cases, especially on large systems. That is a very useful warning.

Stage 2 — Tune monitoring quality vs overhead

Adjust:

aggregation interval,
sampling interval,
min/max regions,
and target scope.

A good rule from the design docs is to set sampling interval roughly proportional to aggregation interval; the default ratio is around 1/20, and that is still the recommended baseline.

Stage 3 — Define “cold enough to touch”

Before any action, define your operating meaning of cold:

not accessed for 30s?
120s?
10 minutes?
only file-backed pages?
never anonymous pages?

This is workload-dependent and nontrivial.

Stage 4 — Add tiny quotas

When you first test DAMON_RECLAIM or DAMOS pageout-like actions, set small quotas and conservative watermarks. The goal is to learn behavior, not win benchmarks on day one.

Stage 5 — Compare system outcomes, not just DAMON stats

Judge success using:

PSI memory pressure,
tail latency,
swapin/swapout behavior,
major fault rates,
refault behavior,
and reclaim noise during bursts.

The kernel feature is only useful if host behavior improves.

9) DAMON sysfs is powerful, but the model matters more than the path names

The sysfs interface is rich and configurable, but it is easy to get lost in the tree.

The shape that matters is:

create a kdamond
create a context
choose operations (vaddr, paddr, etc.)
set monitoring attributes
define targets
optionally define schemes (DAMOS)
start / commit / refresh state

Useful pieces to remember:

sample_us, aggr_us, update_us
nr_regions/min, nr_regions/max
intervals_goal/* for interval auto-tuning
schemes/* for action, access pattern, quotas, watermarks, filters, and stats
refresh_ms for periodic updates of tuned parameters and scheme stats

If you need a human-friendly entry point, the official docs point to DAMO as the default userspace tool layered on top of sysfs.

10) Auto-tuning exists, but it is not permission to stop thinking

DAMON can auto-tune monitoring intervals. The design docs describe this in terms of targeting a desired ratio of observed access events to the theoretical maximum, using knobs such as:

access_bp
aggrs
min_sample_us
max_sample_us

This is useful because interval tuning is otherwise tedious.

But the important operational framing is:

auto-tuning helps find a better observation resolution,
it does not discover your business SLOs,
it does not know the cost of false-cold classification,
and it does not validate whether your action policy is safe.

Treat it as a monitoring-parameter assistant, not an autopilot.

11) The most important caveats

A) Approximation error is inherent

DAMON relies on region assumptions and sampled access signals. That is fine, but it means bad thresholds can misclassify memory.

This is why “observe first” matters so much.

B) Accessed-bit interaction is real

The design docs note that DAMON uses PTE accessed-bit-based checks for virtual and physical address monitoring. That can interfere with other subsystems that care about the same signal.

The docs specifically say:

DAMON does not avoid disturbing idle page tracking,
so handling that interference is on the sysadmin,
while conflict with reclaim logic is handled using PG_idle and PG_young, similar to idle page tracking.

Translation: know what else on the box relies on page-idle-style signals before you assume all observability sources remain independent.

C) The default intervals may be wrong by a lot

This is not a minor footnote. A 100 ms aggregation interval can be far too short for many real systems, which makes the output look flatter and less informative than it should.

D) Overenthusiastic reclaim can backfire

If you reclaim memory that is merely temporarily cold, you may buy lower free-memory anxiety at the cost of:

more refaults,
more major faults,
more swap churn,
and worse tail latency.

A reclaim policy that “looks active” is not necessarily helping.

E) Whole-host tuning can hide workload asymmetry

On mixed-use machines, a single physical-memory policy can accidentally optimize for one workload while penalizing another. Per-process observation with vaddr can reveal this before paddr-level policy obscures it.

12) When to use DAMON_RECLAIM vs DAMON_LRU_SORT

Use DAMON_RECLAIM when your problem is mainly:

too much cold memory persisting,
reclaim arriving too late and too violently,
or wanting controlled, proactive cleanup under moderate pressure.

Use DAMON_LRU_SORT when your problem is mainly:

reclaim victim selection seems poor,
active/inactive list ordering does not reflect reality,
or you want to improve the quality of later reclaim decisions.

Use observation-only DAMON first when:

you do not yet know whether the issue is reclaim timing, reclaim quality, working-set size, or workload-local memory behavior.

These are complementary, not mutually exclusive, but I would almost always start with observation and only then decide which action layer fits.

13) Practical rollout checklist

Before production use, answer these:

Monitoring design

Are you observing vaddr, fvaddr, or paddr?
Is aggregation interval long enough to capture meaningful variation?
Are region bounds sensible for system size?
Do the snapshots distinguish hot from cold, or is everything mush?

Policy design

What exact age threshold defines “cold”?
Are anonymous pages included or skipped?
Which watermark band should activate the policy?
What quotas cap CPU and I/O side effects?

Validation design

Did PSI improve?
Did refaults increase?
Did swap traffic spike?
Did tail latency improve during burst pressure?
Did nr_quota_exceeds reveal an over-constrained or underpowered policy?

Safety design

What is the rollback path?
How quickly can you disable the module or set enabled=N?
Are you monitoring host-level regressions outside the memory subsystem itself?

14) Where I would reach for DAMON first

I would reach for DAMON first when I have a Linux host where memory pain is pattern-shaped rather than simply “we need more RAM.”

Examples:

cache-heavy services with long-lived cold tails,
dev or CI hosts that quietly accumulate stale working sets,
systems where reclaim storms are rare but brutal,
and experiments with tiered or NUMA-aware placement that need better access signals.

I would not reach for it first when the real problem is obviously:

gross underprovisioning,
memory leaks,
pathological allocator behavior,
or application-local cache policy bugs.

In those cases, DAMON may still help diagnose, but it is not the first fix.

Bottom line

DAMON is one of the more interesting Linux memory-management tools because it is neither hand-wavy observability nor blind automation. It gives you a structured way to ask:

what is actually hot,
what is cold enough to touch,
and how aggressively should the kernel act on that information?

Its power comes from three things used together:

approximate but useful access monitoring,
bounded-overhead policy control,
and operator-visible safeguards like quotas and watermarks.

The right mindset is not “turn on magic reclaim.” It is:

measure access shape, define cold conservatively, then act with quotas.

Used that way, DAMON is not just a curiosity from kernel docs. It becomes a real playbook tool for reducing reclaim chaos and making Linux memory behavior more intentional.

References

Linux kernel documentation: DAMON overview
https://docs.kernel.org/admin-guide/mm/damon/
Linux kernel documentation: Getting started with DAMON / DAMO
https://docs.kernel.org/admin-guide/mm/damon/start.html
Linux kernel documentation: Detailed DAMON usage / sysfs interface
https://docs.kernel.org/admin-guide/mm/damon/usage.html
Linux kernel documentation: DAMON design
https://docs.kernel.org/mm/damon/design.html
Linux kernel documentation: DAMON-based Reclamation
https://docs.kernel.org/admin-guide/mm/damon/reclaim.html
Linux kernel documentation: DAMON-based LRU-lists Sorting
https://docs.kernel.org/admin-guide/mm/damon/lru_sort.html