Over the last two weeks, there have been 4 occurrences of the ZFS Arc growing past the value in zfs_arc_max and it causes the kernel to invoke oom-killer to start killing my docker containers. This apparently isn’t enough, and I eventually lose networking, resulting in me needing to reboot my machine.
Here is a screenshot from the ZFS Arc size graph from when this occurred today. Each of the four instances that this has happened all follow a similar pattern.
At the time that both of these spikes occurred, /var/log/syslog shows that the kernel started invoking oom-killer.
I will note that I tried adding a pre-init script under System > Advanced > Init/Shutdown Scripts that does the following: echo 8589934592 >> /sys/module/zfs/parameters/zfs_arc_max. This number is reflected in the screenshot of the first graph. However, this change does not seem to have had any effect (since the first 3 times this occurred was before I had this pre-init script in place).
Does anyone have any clues as to what is happening? Could the ZFS ARC size growth be a symptom of some other process hogging memory?
I am running TrueNAS Scale 25.04.2.4.
I have no VMs.
I am running on bare metal, not virtualized.
And yes, I have only 32 GB of RAM.
Some other info that might be useful:
CPU: Intel Core i5-10600K
I have 3 pools:
8x 20 TB in RAIDZ2
1x 1 TB as a single disk
1x 100ish GB as a single disk
Docker containers include Plex with HW transcoding via Intel QuickSync, Immich, most of the arr suite, qBittorrent with a VPN, Soulseek with a VPN, and some other miscellaneous containers.
Let me know if there is anything else I can provide.
So I didn’t have a limit on how much memory qBittorrent can use until today when I set the limit to 8 GB. Plex rarely consumes more than 1 GB. But are you saying that a runaway docker container can cause the ARC size to grow uncontrollably? Or is the growing ARC size typical behavior?
The only thing that I had changed leading up to the first occurrence is the addition of Lidarr. Other than that, my main storage pool did exceed 60ish terabytes recently.
My system does not have a dedicated GPU, just an Intel iGPU. The iGPU should be enabled in the BIOS because Plex is successfully using it for hardware transcoding.
FWIW, the ARC size spiked again about 15 minutes ago, but the new limit on qBittorrent likely kept the kernel from killing processes. However, the spikes in the ARC size are still new to me.
This can cause issues if its VRAM is “shared” with your actual RAM, since it does not have its own dedicated VRAM (like an Nvidia or AMD card has.) Combined with ARC that likes to grow (for good reason), it’s possible to hit OOM.
It’s very possible it’s due to Plex and/or qBittorrent, regardless if they are run inside a container or as installed software on the host directly.
I’m still on the fence on whether or not ZFS + an unrestricted ARC has really been “solved” on Linux. With FreeBSD, the ARC can come within 1GB of total system RAM, yet still gracefully shrink and grow when met with memory pressure from other applications and processes.
But, as for the spikes, is there anything I can do there? I’m still baffled why ZFS doesn’t respect the ARC max that I have set. Are there more things I can tune to keep it from growing to the point that the kernel steps in?
You shouldn’t have to arbitrarily set any limits to your ARC. Proper memory management should be able to dynamically and gracefully handle demands in pressure. Ideally, your ARC should be able to come within 1GB of your RAM capacity (if it wants to and is able to) without risking OOM.
What might be happening is that your ARC climbs (as it should), but does not dynamically and gracefully adjust when your memory is under added pressure (Plex, qBittorrent).
When Plex transcodes, it’s very possible that it’s directly competing with your ARC, since the iGPU needs to use your actual system RAM as its “VRAM” during the transcode. A discrete video card (Nvidia, AMD) has its own dedicated VRAM, without needing to eat into your system RAM.
Please check the About page of the qBittorrent WebUI to confirm the libtorrent version. If libtorrent is 2.x, please switch to 1.0. Then, continue monitoring memory usage.
I have confirmed that libtorrent 2.x can cause severe OOM and kernel panics under certain unknown conditions.
Since version 2.x, libtorrent has defaulted to memory mapping, handing memory management off to the Linux kernel. libtorrent 2.0 has consistently suffered from issues like out-of-memory (OOM) errors, but its release has been limited to only annual updates, leaving many issues unfixed.
Based on my experience with over 30 kernel crashes in the past year, if a NAS running qBittorrent experiences an OOM or random crash, there’s a 70% chance that libtorrent 2.0 is to blame.
If that is the case, wouldn’t I have started seeing this much sooner than 2 weeks ago? I am running the latest tag of this image, and the last time it was pushed was 3 months ago.
I’m still lost as to why it’s happening now. Looking at the historical data for the ARC size, it has peaked several times at 22ish GB over the last several weeks. However, October 5th was the first time that the kernel started invoking oom-killer. Could it just be that my pools have just reached the tipping point where they were just teetering the line before the 5th?
If ARC is just going to climb above the max whenever it feels like it, then should I even bother?
Do you suspect a memory problem? I feel like dying RAM would manifest itself in a much more erratic and unpredictable fashion .
Right when the ARC size starts to climb, I completely lose SMB and the networking speed of my qBittorrent VPN container drops to a crawl. The system is still usable and can still reach the internet, but both SMB and qBittorrent are mostly dead until the sharp drop off a few minutes later.
So guys, I did a few things. Just yesterday, I went and bought a 64 GB RAM kit and installed that into my system (double from the original 32 GB). I then removed the 8 GB cap I had set for zfs_arc_max. The issue seems to have gone away . No more uncontrollable ARC usage spikes stressing out the kernel.
Just from watching the numbers since last night, it does seem like my ARC was definitely much more memory hungry than I had originally thought. Back when I only had 32 GB of RAM and wasn’t limiting how much RAM qBittorrent was allowed to consume (an average of 10ish GB), the ARC averaged around 12 GB. Now it seems to average around a noisy 36 (smooth 48 when qBittorrent isn’t running with 8 GB of RAM).
So, to summarize, it just seems like my main storage pool grew to the point where my RAM was no longer sufficient to satisfy ARC. Makes me wonder about the dude who said he successfully ran 700 TB on 64 GB of RAM for years…
So guys, even with 64 GB of RAM, the kernel still felt it necessary to invoke oom-killer. The screenshot shows the timeline from October 16th. /var/log/syslog shows that the kernel killed qBittorrent right at 11:15. The timing with oom-killer and the sharp dropoff in the ARC size seem too close for it to be coincidental. I will note that I did update the qBittorrent container to allow it 8 GB of RAM a bit after I had installed the 64 GB of RAM. Perhaps that was a mistake?
So, I bit the bullet and doubled my RAM yet again. Now it typically hovers around 100 GB, but I have not yet seen the kernel invoke oom-killer . Here is some output from arc_summary.
So yeah, I’m still kind of at a loss. At this point, my best guess is that qBittorrent is stressing out the ARC? I will say that I still have OS caching enabled in qBittorrent. I tried turning it off a few times, but there was a severe decline in performance that I didn’t want to tolerate. Maybe that was also a mistake? I do need to do a bit more research on some tuning I can do for qBittorrent to make it more friendly with the ARC.