I very recently swapped from Core to Scale along with a hardware change.
I will start by noting this is a fresh install, and I am running no apps.
I am seeing a strange behavior within scale causing the system to become unresponsive, or completely crash in some instances.
When performing a backup of a VM in proxmox to an NFS share based on a HDD dataset, with the source being a fast NVME stripe on the same machine, there is a stage of the backup process that reliably triggers out-of-memory messages ultimately causing middleware, nfsd, samba etc to be killed and the system to become unresponsive.
kernel: nfsd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER), order=0, oom_score_adj=0
This message is triggered by a large number of processes, not just nfsd.
My working theory after following a few threads in relation to changes to the 04.1.1 release is that ARC behavior has changed, but I cannot say this wasn’t an issue in prior version of scale as I did not run them.
I set zfs_arc_max to be 50% of my RAM capacity for testing, which does sort-of resolve the issue. With this parameter set, ssh remains up and barely responsive during the problematic part of the backup, and the backups will usually complete, however monitoring the free memory still shows a huge spike of at least 13gb of RAM usage up to the max available in my system (32gb), presumably by NFS. The GUI still crashes here, but most things shortly recover, and the backup does complete successfully in proxmox.
The same backup action used to cause no issues on Core on half of the available RAM of this system. All networking is 10gig. RAM is ECC and the pool is small so 32gb should suffice in this instance.
Something about the backup process from proxmox, specifically the end part where it states “backup is sparse” which I presume is where it is running compression, is where the ram usage usually takes off, and I am guessing ZFS cannot clear the space fast enough resulting in memory pressure?
From proxmox:
INFO: 100% (200.0 GiB of 200.0 GiB) in 3m 21s, read: 1.3 GiB/s, write: 8.0 KiB/s
INFO: backup is sparse: 181.15 GiB (90%) total zero data
–crash happens here without zfs_arc_max change–
I can replicate this reliably.
Whilst I’m a wintel engineer by profession, I am new to linux troubleshooting, so please go easy if I’m missing something.
Any thoughts on next steps are appreciated.