ZFS ARC usage spike causing kernel to invoke oom-killer

I think i should start following each post @winnielinnie makes and post this under them

2 Likes

It just gets weirder and weirder.

We are talking about short lived peaks in memory usage, right?

One could … gasp … use swap for that.

The decision to run a server system without at least the order of magnitude of physical memory as swap space is something I will never understand.

Back in the days when servers had 256 Megabytes of memory we used to have 4x mem as swap. Then 2x, now I provision new servers with 1x or .5x the physical memory reserved for swap. But never without. My customers would roast me if their database was killed because Elasticsearch decides to grab 32 Gigabytes by default.

Have swap space. Monitor usage. Alarm an operator if necessary.

3 Likes

A system on the same kernel version (just newer official patches) and same zfs version as 25.10 release of truenas, with 1.5x total size of ram in form of swap (half of it being zram and another — a block device), GASP had a compiler process get killed just today. In a later attempt, the system instead slowed down to a crawl. Unfortunately, neither of these was really about “short lived peaks”.

Restoring the max arc size limit (at half the ram), which I have removed for the purposes of testing this behavior, restored adequate functionality and system was able to fly through the same build without a hitch.

Before dismissing such a case as “probably something obscure or not recommended that the user is doing”, please note that I don’t have a terrible mismatch in system performance and storage capacity and try to stay away from the suggested limits to configuration, such as overfilling the pool capacity, and the distro uses defaults for most (including zfs) packages.

That said, I suppose it’s also important to point out at least this PR, which made it’s way into 2.4.0 RCs, and 2.3.4 release, and might get backported into 2.2.x branch, so this will likely still be a problem with <2.3.0 (meaning 25.04 and earlier versions) unless patched by truenas.

Granted, if this is not the only issue (as my experience above would suggest), then 2.3.4+ should at least have a decreased likelihood, statistically speaking, to encounter such a problem.

With all due respect to both zfs and truenas, this is far from the first time I saw this behavior, as I’ve been periodically testing for it (on both truenas and other systems), in the exact same wish to remove the crude manual limitation, but each time the tests meet same fate.