ZFS ARC usage spike causing kernel to invoke oom-killer

LarsR · October 21, 2025, 5:45pm

I think i should start following each post @winnielinnie makes and post this under them

winnielinnie · November 3, 2025, 4:51pm

pmh · November 3, 2025, 5:47pm

We are talking about short lived peaks in memory usage, right?

One could … gasp … use swap for that.

The decision to run a server system without at least the order of magnitude of physical memory as swap space is something I will never understand.

Back in the days when servers had 256 Megabytes of memory we used to have 4x mem as swap. Then 2x, now I provision new servers with 1x or .5x the physical memory reserved for swap. But never without. My customers would roast me if their database was killed because Elasticsearch decides to grab 32 Gigabytes by default.

Have swap space. Monitor usage. Alarm an operator if necessary.

JkktBkkt · November 3, 2025, 8:54pm

A system on the same kernel version (just newer official patches) and same zfs version as 25.10 release of truenas, with 1.5x total size of ram in form of swap (half of it being zram and another — a block device), GASP had a compiler process get killed just today. In a later attempt, the system instead slowed down to a crawl. Unfortunately, neither of these was really about “short lived peaks”.

Restoring the max arc size limit (at half the ram), which I have removed for the purposes of testing this behavior, restored adequate functionality and system was able to fly through the same build without a hitch.

Before dismissing such a case as “probably something obscure or not recommended that the user is doing”, please note that I don’t have a terrible mismatch in system performance and storage capacity and try to stay away from the suggested limits to configuration, such as overfilling the pool capacity, and the distro uses defaults for most (including zfs) packages.

That said, I suppose it’s also important to point out at least this PR, which made it’s way into 2.4.0 RCs, and 2.3.4 release, and might get backported into 2.2.x branch, so this will likely still be a problem with <2.3.0 (meaning 25.04 and earlier versions) unless patched by truenas.

github.com/openzfs/zfs

enforce arc_dnode_limit

master ← shodanshok:dnodelimit

opened 03:40PM - 15 Jul 25 UTC

shodanshok

+78 -11

Fix: https://github.com/openzfs/zfs/issues/17487 Linux kernel shrinker in the… context of null/root memcg does not scan dentry and inode caches added by a task running in non-root memcg. For ZFS this means that dnode cache routinely overflows, evicting valuable meta/data and putting additional memory pressure on the system. This patch restores zfs_prune_aliases as fallback when the kernel shrinker does nothing, enabling zfs to actually free dnodes. Moreover, it (indirectly) calls arc_evict when dnode_size > dnode_limit. ### Motivation and Context See above. ### Description See above. ### How Has This Been Tested? The patch received only minimal testing inside a virtual machine, clearly increasing performance by letting ZFS to not evict valuable metadata. It works for me, but additional review is needed. ### Types of changes - [X] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [X] Performance enhancement (non-breaking change which improves efficiency) - [ ] Code cleanup (non-breaking change which makes code smaller or more readable) - [ ] Quality assurance (non-breaking change which makes the code more robust against bugs) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Library ABI change (libzfs, libzfs\_core, libnvpair, libuutil and libzfsbootenv) - [ ] Documentation (a change to man pages or other documentation) ### Checklist: - [X] My code follows the OpenZFS [code style requirements](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md#coding-conventions). - [ ] I have updated the documentation accordingly. - [X] I have read the [**contributing** document](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md). - [ ] I have added [tests](https://github.com/openzfs/zfs/tree/master/tests) to cover my changes. - [ ] I have run the ZFS Test Suite with this change applied. - [X] All commit messages are properly formatted and contain [`Signed-off-by`](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md#signed-off-by).

Granted, if this is not the only issue (as my experience above would suggest), then 2.3.4+ should at least have a decreased likelihood, statistically speaking, to encounter such a problem.

With all due respect to both zfs and truenas, this is far from the first time I saw this behavior, as I’ve been periodically testing for it (on both truenas and other systems), in the exact same wish to remove the crude manual limitation, but each time the tests meet same fate.

Jip-Hop · January 10, 2026, 6:13pm

Not sure what do do now personally… I’m running into similar OOM erros myself since I started using the immich docker stack on TrueNAS SCALE 25.04.2.6 together with iGPU HW transcoding (Intel Quick Sync). The system has 64GB RAM and the Physical memory available graph in the TrueNAS WEB GUI never dips below 30GB…

2026 Jan 10 17:45:28 nas Process 384394 (immich) of user X dumped core.

The immich container enters a crash loop (OOM kill) every 3 minutes in my case. Looks like exactly the issue described in this thread.

I’m considering adding a swap file but according to @kris this may cause more system instability than it solves?

ChapterSevenSeeds · January 10, 2026, 6:27pm

Would you be willing to try TrueNAS Scale 25.10? I remember reading that 25.10 tweaks how memory is managed. I haven’t had any RAM issues since upgrading to that version (but that could also be attributed to my increase to 128 GB RAM).

Jip-Hop · January 10, 2026, 7:09pm

I may have to try that. I disabled HW transcoding (Intel Quick Sync) in immich but that didn’t fix it. I also separately tried limiting the zfs_arc_max to 32GB and 14GB and adding resource limits/reservations in my docker compose file to no avail. I tried limits as low as 4G and as high as 16G.

Jip-Hop · January 10, 2026, 10:51pm

Unfortunately upgrading to 25.10.1 didn’t fix the OOM killing of immich for me. Tried with and without resource constraints in the docker compose stack and with HW transcoding (Intel Quick Sync) enabled.

Jip-Hop · January 11, 2026, 2:41am

I was probably on the wrong track and it seems my issue is immich specific and not related to memory usage.

Topic		Replies	Views
Potential issue - ZFS ARC memory allocations TrueNAS General SCALE , ARC	40	1363	August 28, 2024
Consuming more RAM? 25.04.02 TrueNAS General	27	451	November 16, 2025
RAM Size Guidance for Dragonfish TrueNAS General SCALE , Hardware	89	3938	May 13, 2024
ZFS_ARC_MAX issue - out-of-memory errors in kernel with Scale 24.04.1.1 TrueNAS General SCALE , ZFS	118	3703	December 30, 2024
TrueNAS 25.04.0 now available! Announcements	199	7265	June 13, 2025

ZFS ARC usage spike causing kernel to invoke oom-killer

Related topics