Please file a bug ticket with debug attached so we can see the state of the system when NFS starts getting killed. The common thread seems to be NFS in these cases, we’ll need to make sure there’s not a memory leak in that stack somewhere.
I am seeing the same behavior with SMB shares. There usually smbd is the process that uses lots of memory and is the one getting killed. I switched from SMB shares to NFS shares, thinking that this might be a leak in smbd, but now i am seeing the same behavior as described above again this time with asynio_loop getting killed.
Tickets please. We need to see what is consuming RAM on the system. I already had one report like this from a user with 8GB RAM, 20 Apps and a VM… No wonder all the OOMs
^^Same here with nfs. TN scale running as a vm under proxmox with sata in pass through, 16GB ram allocated. No apps, just nfs and smb shares.
After a reboot, usually within a day or so there will be OOM in the log. In fact, I can’t even log into the local console of the vm but ssh still works.
Can you provide a link to how to generate a debug log?
Here’s a log capture from the gui log screen.
Ticket submitted.
Got it!
Looks like we were able to review the debug and confirm it was the same issue we are chasing in another ticket. Thanks, that helps provide additional collateral to understanding the root cause. We’ll keep chasing it.
Are you also using ZFS encryption on datasets or the entire pool?
Thank you! Let me know, if I can help
There is no zfs encryption.
Will do! Right now it’s about finding the right breadcrumbs that lead us to a reproduction case. We’ve not been able to trigger this internally as our performance team has been thrashing systems in the lab. This usually means there are some combination of factors in play.
If you can think of any other user-specific configuration knobs that have been deployed, please do pass them along!
@kris I have a daily syncthing job on a windows box. This is set up as a push only to the syncthing LXC container. TN NFS shares are mapped to syncthing container using bindmounts.
The job usually starts at 1631, but truenas showing some kind of kernel trace/crash around 1634. This is not consistent however. The pastebin above is from 2044. I don’t recall any large transfers taking place at that time other than media streaming from one of the datasets or a download to a dataset taking place.
I am using pool encryption. The pool is SATA SSDs passed into the VM via PCIE pass-through. The entire controller is passed into VM. There is a second pool with spinning rust and Optane drives for log and metadata 2 separate mirrors.
The only odd thing I can think off is that I have scrub tasks configured, but they seem to never run. At least the dashboard shows the scrubs never ran on either pool.
While I am thinking about some less radical solution, could somebody reliably reproducing the issue try to set zfs_arc_shrinker_limit
parameter of ZFS to zero with: echo "0" > /sys/module/zfs/parameters/zfs_arc_shrinker_limit
? Or at least dramatically higher than extremely low default 10000 pages. It should make ZFS completely obedient to all memory pressure requests from the kernel, that I expect to fix the OOM issues. Unfortunately there are known cases when some of those requests are insane, that you may see in a form of dramatic ARC size reductions, but I hope it should not be as bad as before we disabled MGLRU code in 24.04.1.
I’ve rebooted then modified parameter as per above. Will recheck in about 6 hours after it’s had some uptime and syncthing job runs.
I added it as a post-init script I’ll see what happens.
I manually triggered a few backups and /var/log/messages stayed beautifully silent. This setting seems to make things work. I’ll leave it alone overnight and will check again.
I need to give it at least a few days to see if this change triggers anything. But so far, this afternoon’s syncthing job ran without issues. If it’s like this in 3 days, then I’ll be satisfied the parameter change was successful.
I will continue testing tonight, but initial results look like this is the magic bullet!
I was able to get through 5-6 proxmox backups without any errors being thrown in messages and middleware remained fully responsive, whereas previously I couldn’t get through 1 or 2.
This showed up in the log just a few min ago. However no crashes or kernel traces similar to above. Local console is still accessible. Nothing happening on the nas at this time stamp, just idle.
Jun 26 22:53:25 nas1 kernel: loop0: detected capacity change from 0 to 2575752 Jun 26 22:53:25 nas1 kernel: squashfs: version 4.0 (2009/01/31) Phillip Lougher