ZFS ARC usage spike causing kernel to invoke oom-killer

I think i should start following each post @winnielinnie makes and post this under them

2 Likes

It just gets weirder and weirder.

We are talking about short lived peaks in memory usage, right?

One could … gasp … use swap for that.

The decision to run a server system without at least the order of magnitude of physical memory as swap space is something I will never understand.

Back in the days when servers had 256 Megabytes of memory we used to have 4x mem as swap. Then 2x, now I provision new servers with 1x or .5x the physical memory reserved for swap. But never without. My customers would roast me if their database was killed because Elasticsearch decides to grab 32 Gigabytes by default.

Have swap space. Monitor usage. Alarm an operator if necessary.

3 Likes

A system on the same kernel version (just newer official patches) and same zfs version as 25.10 release of truenas, with 1.5x total size of ram in form of swap (half of it being zram and another — a block device), GASP had a compiler process get killed just today. In a later attempt, the system instead slowed down to a crawl. Unfortunately, neither of these was really about “short lived peaks”.

Restoring the max arc size limit (at half the ram), which I have removed for the purposes of testing this behavior, restored adequate functionality and system was able to fly through the same build without a hitch.

Before dismissing such a case as “probably something obscure or not recommended that the user is doing”, please note that I don’t have a terrible mismatch in system performance and storage capacity and try to stay away from the suggested limits to configuration, such as overfilling the pool capacity, and the distro uses defaults for most (including zfs) packages.

That said, I suppose it’s also important to point out at least this PR, which made it’s way into 2.4.0 RCs, and 2.3.4 release, and might get backported into 2.2.x branch, so this will likely still be a problem with <2.3.0 (meaning 25.04 and earlier versions) unless patched by truenas.

Granted, if this is not the only issue (as my experience above would suggest), then 2.3.4+ should at least have a decreased likelihood, statistically speaking, to encounter such a problem.

With all due respect to both zfs and truenas, this is far from the first time I saw this behavior, as I’ve been periodically testing for it (on both truenas and other systems), in the exact same wish to remove the crude manual limitation, but each time the tests meet same fate.

1 Like

Not sure what do do now personally… I’m running into similar OOM erros myself since I started using the immich docker stack on TrueNAS SCALE 25.04.2.6 together with iGPU HW transcoding (Intel Quick Sync). The system has 64GB RAM and the Physical memory available graph in the TrueNAS WEB GUI never dips below 30GB…

2026 Jan 10 17:45:28 nas Process 384394 (immich) of user X dumped core.

The immich container enters a crash loop (OOM kill) every 3 minutes in my case. Looks like exactly the issue described in this thread.

I’m considering adding a swap file but according to @kris this may cause more system instability than it solves?

Would you be willing to try TrueNAS Scale 25.10? I remember reading that 25.10 tweaks how memory is managed. I haven’t had any RAM issues since upgrading to that version (but that could also be attributed to my increase to 128 GB RAM).

I may have to try that. I disabled HW transcoding (Intel Quick Sync) in immich but that didn’t fix it. I also separately tried limiting the zfs_arc_max to 32GB and 14GB and adding resource limits/reservations in my docker compose file to no avail. I tried limits as low as 4G and as high as 16G.

Unfortunately upgrading to 25.10.1 didn’t fix the OOM killing of immich for me. Tried with and without resource constraints in the docker compose stack and with HW transcoding (Intel Quick Sync) enabled.

I was probably on the wrong track and it seems my issue is immich specific and not related to memory usage.