ZFS_ARC_MAX issue - out-of-memory errors in kernel with Scale 24.04.1.1

kris · June 25, 2024, 12:38pm

Please file a bug ticket with debug attached so we can see the state of the system when NFS starts getting killed. The common thread seems to be NFS in these cases, we’ll need to make sure there’s not a memory leak in that stack somewhere.

ivonnyssen · June 25, 2024, 5:29pm

I am seeing the same behavior with SMB shares. There usually smbd is the process that uses lots of memory and is the one getting killed. I switched from SMB shares to NFS shares, thinking that this might be a leak in smbd, but now i am seeing the same behavior as described above again this time with asynio_loop getting killed.

kris · June 25, 2024, 5:43pm

Tickets please. We need to see what is consuming RAM on the system. I already had one report like this from a user with 8GB RAM, 20 Apps and a VM… No wonder all the OOMs

das1996 · June 25, 2024, 6:03pm

^^Same here with nfs. TN scale running as a vm under proxmox with sata in pass through, 16GB ram allocated. No apps, just nfs and smb shares.

After a reboot, usually within a day or so there will be OOM in the log. In fact, I can’t even log into the local console of the vm but ssh still works.

Can you provide a link to how to generate a debug log?

Here’s a log capture from the gui log screen.

ivonnyssen · June 25, 2024, 6:33pm

Ticket submitted.

kris · June 25, 2024, 6:35pm

Got it!

kris · June 25, 2024, 6:56pm

Looks like we were able to review the debug and confirm it was the same issue we are chasing in another ticket. Thanks, that helps provide additional collateral to understanding the root cause. We’ll keep chasing it.

https://ixsystems.atlassian.net/browse/NAS-129533

kris · June 25, 2024, 7:12pm

Are you also using ZFS encryption on datasets or the entire pool?

ivonnyssen · June 25, 2024, 7:19pm

Thank you! Let me know, if I can help

das1996 · June 25, 2024, 7:21pm

There is no zfs encryption.

kris · June 25, 2024, 7:38pm

Will do! Right now it’s about finding the right breadcrumbs that lead us to a reproduction case. We’ve not been able to trigger this internally as our performance team has been thrashing systems in the lab. This usually means there are some combination of factors in play.

If you can think of any other user-specific configuration knobs that have been deployed, please do pass them along!

das1996 · June 25, 2024, 7:45pm

@kris I have a daily syncthing job on a windows box. This is set up as a push only to the syncthing LXC container. TN NFS shares are mapped to syncthing container using bindmounts.

The job usually starts at 1631, but truenas showing some kind of kernel trace/crash around 1634. This is not consistent however. The pastebin above is from 2044. I don’t recall any large transfers taking place at that time other than media streaming from one of the datasets or a download to a dataset taking place.

ivonnyssen · June 25, 2024, 8:24pm

I am using pool encryption. The pool is SATA SSDs passed into the VM via PCIE pass-through. The entire controller is passed into VM. There is a second pool with spinning rust and Optane drives for log and metadata 2 separate mirrors.

The only odd thing I can think off is that I have scrub tasks configured, but they seem to never run. At least the dashboard shows the scrubs never ran on either pool.

mav · June 26, 2024, 4:05pm

While I am thinking about some less radical solution, could somebody reliably reproducing the issue try to set zfs_arc_shrinker_limit parameter of ZFS to zero with: echo "0" > /sys/module/zfs/parameters/zfs_arc_shrinker_limit ? Or at least dramatically higher than extremely low default 10000 pages. It should make ZFS completely obedient to all memory pressure requests from the kernel, that I expect to fix the OOM issues. Unfortunately there are known cases when some of those requests are insane, that you may see in a form of dramatic ARC size reductions, but I hope it should not be as bad as before we disabled MGLRU code in 24.04.1.

das1996 · June 26, 2024, 4:25pm

I’ve rebooted then modified parameter as per above. Will recheck in about 6 hours after it’s had some uptime and syncthing job runs.

ivonnyssen · June 26, 2024, 4:32pm

I added it as a post-init script I’ll see what happens.

ivonnyssen · June 27, 2024, 12:18am

I manually triggered a few backups and /var/log/messages stayed beautifully silent. This setting seems to make things work. I’ll leave it alone overnight and will check again.

das1996 · June 27, 2024, 12:21am

I need to give it at least a few days to see if this change triggers anything. But so far, this afternoon’s syncthing job ran without issues. If it’s like this in 3 days, then I’ll be satisfied the parameter change was successful.

Joel_Gray · June 27, 2024, 2:36am

I will continue testing tonight, but initial results look like this is the magic bullet!

I was able to get through 5-6 proxmox backups without any errors being thrown in messages and middleware remained fully responsive, whereas previously I couldn’t get through 1 or 2.

das1996 · June 27, 2024, 3:57am

This showed up in the log just a few min ago. However no crashes or kernel traces similar to above. Local console is still accessible. Nothing happening on the nas at this time stamp, just idle.

Jun 26 22:53:25 nas1 kernel: loop0: detected capacity change from 0 to 2575752 Jun 26 22:53:25 nas1 kernel: squashfs: version 4.0 (2009/01/31) Phillip Lougher