ZFS_ARC_MAX issue - out-of-memory errors in kernel with Scale 24.04.1.1

I very recently swapped from Core to Scale along with a hardware change.

I will start by noting this is a fresh install, and I am running no apps.
I am seeing a strange behavior within scale causing the system to become unresponsive, or completely crash in some instances.

When performing a backup of a VM in proxmox to an NFS share based on a HDD dataset, with the source being a fast NVME stripe on the same machine, there is a stage of the backup process that reliably triggers out-of-memory messages ultimately causing middleware, nfsd, samba etc to be killed and the system to become unresponsive.

kernel: nfsd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER), order=0, oom_score_adj=0

This message is triggered by a large number of processes, not just nfsd.

My working theory after following a few threads in relation to changes to the 04.1.1 release is that ARC behavior has changed, but I cannot say this wasn’t an issue in prior version of scale as I did not run them.

I set zfs_arc_max to be 50% of my RAM capacity for testing, which does sort-of resolve the issue. With this parameter set, ssh remains up and barely responsive during the problematic part of the backup, and the backups will usually complete, however monitoring the free memory still shows a huge spike of at least 13gb of RAM usage up to the max available in my system (32gb), presumably by NFS. The GUI still crashes here, but most things shortly recover, and the backup does complete successfully in proxmox.

The same backup action used to cause no issues on Core on half of the available RAM of this system. All networking is 10gig. RAM is ECC and the pool is small so 32gb should suffice in this instance.

Something about the backup process from proxmox, specifically the end part where it states “backup is sparse” which I presume is where it is running compression, is where the ram usage usually takes off, and I am guessing ZFS cannot clear the space fast enough resulting in memory pressure?

From proxmox:
INFO: 100% (200.0 GiB of 200.0 GiB) in 3m 21s, read: 1.3 GiB/s, write: 8.0 KiB/s
INFO: backup is sparse: 181.15 GiB (90%) total zero data
–crash happens here without zfs_arc_max change–

I can replicate this reliably.

Whilst I’m a wintel engineer by profession, I am new to linux troubleshooting, so please go easy if I’m missing something.

Any thoughts on next steps are appreciated.

2 Likes

Can you get us a bug ticket and debug file from the system when it starts to become unstable? With no Apps, I would have to assume something is consuming memory and not releasing. (ZFS or NFS is likely). You can also kickoff an htop and watch it in real-time to see what is holding it.

1 Like

As far as here on the forums, would be useful to see the complete Scale hardware list.

1 Like

Will do, I’ll be free to replicate the issue and pull debug files first thing tomorrow.

Cheers.

wondering if this is related to what i’m seeing. i’m getting similar symptoms with the GUI and SSH going unresponsive. my last crash i saw my ARC get up almost my entire memory footprint.

Hard to say. This htop screenshot doesn’t show a system that has run out of memory yet, it’s barely half used. ARC is by design supposed to use as much RAM as is available. Unused RAM is wasted RAM. But its also supposed to give it back and shrink as other services reserve more.

1 Like

How many VMs are you running on the server and how much memory have you commited to them?

I have 8 VMs with 64G commited to them.

I think ARC using the whole of the available system memory is a good thing providing it works correctly.

If you check /var/log/messages are you seeing OOM-Killer invoked at all killing these processes due to memory pressure?

Hi kris.

I spent some time replicating the issue tonight and got a bug ticket submitted.
I had to submit it post reboot as replicating the issue causes the system to crash to a grub prompt immediately now, so I cannot obtain it mid-issue. I hope that it is still of some use.

I’ve also replicated this behavior on a friends system running the same version of scale. We have similar setups.

If you need anything else let me know.
I obtained a htop maybe 15 seconds before the full system crash / ssh disconnection:
I think ARC was shrinking somewhat, but SSH wasn’t able to update quickly.

1 Like

Thanks, got the ticket here. We’ll take a look at it!

I believe I have the same issue.

Clean install, absolutely no additional appliances or VMs except NFS services.

After a few hours, i will get OOM errors for asyncio_loop & nfsd when doing heavy writes to pool through the NFS share.

Exact same setup worked perfectly on the same system, only started having issues since replacing the boot disk and doing a fresh install of TrueNAS Scale.

AMD Ryzen 5 3400G
48GB Ram
60 disks of 16TB each in a shelf + 1 x MIRROR | 2 wide | 465.76 GiB

Hope this helps.

1 Like

We believe (and have some reports that also seems to confirm) that this may be fixed in the upcoming 24.04.2. If you want to give a nightly a shot to confirm in your case, you can grab this update file and install it via the UI as a manual update.

https://update.sys.truenas.net/scale/TrueNAS-SCALE-Dragonfish-Nightlies/TrueNAS-SCALE-24.04.2-MASTER-20240614-013916.update

i think i should submit a bug report also, it appears setting the max doesnt always hold. i had mine as ~32G, a few days later it decided to ignore what i set in both command line and init, to 42+G, which caused an issue with launching VMs. look up my recent post to this for the OOM issues.
my short research has me flushing all cache, and it slowly rebuilding, but it needs to do as i say and not exceed what i set. (almost an apple/windows “we know best” scenario)

fast turnaround if so! thanks, I will see if I can get this file tested shortly.