ZFS_ARC_MAX issue - out-of-memory errors in kernel with Scale 24.04.1.1

kris · June 27, 2024, 11:47am

Yea that can be ignored. Squashfs is used as part of the update system in the background, that message is normal.

ivonnyssen · June 27, 2024, 3:38pm

I had another OOM kill12 seconds after midnight. The only thing that was scheduled at the time was a scrub task for both pools. It looks like neither scrub task completed, as the dashboard still shows “never” for the last run. There should not have been any other pressure on the system at that time.

mav · June 27, 2024, 3:54pm

Scrub can be relatively memory-intensive, especially if several pools are scrubbed same time. You should try to spread them. And IIRC that memory is not counted towards ARC, so it may create a spike of system-wide memory pressure, and ARC is supposed to shrink on a signal from OS, if needed. I’d need to know more about it. TrueNAS debug would be good.

PS: Whether scrub was run I would look directly via zpool status.

ivonnyssen · June 27, 2024, 9:44pm

Spool status shows that scan only ever ran for the boot pool. I have spread the scrub tasks by an hour for now and will see how that goes. It seems odd though, as the system has 64GB and is doing nothing else but site there at that time. I downloaded a debug, happy to upload that somewhere if it helps.

sfatula · June 27, 2024, 10:15pm

Probably ran a few minutes only then depending on the device. I am following your journey closely as I am not upgrading until these oom can be fixed since they disabled swap space.

mav · June 28, 2024, 2:50pm

I downloaded a debug, happy to upload that somewhere if it helps.

I don’t know if you can send it to me here, but you could send me a link. Or you may open a ticket, where debugs are uploaded in a way visible only to iX developers.

das1996 · June 28, 2024, 3:49pm

Still no OOM crashes here. Just some messages like below.

Jun 27 11:38:41 nas1 systemd-journald[617]: Data hash table of /var/log/journal/6cbbb533d69b400c852bfff245b1fa40/system.journal has a fill level at 75.0 (8533 of 11377 items, 6553600 file size, 768 bytes per hash table item), suggesting rotation.

Jun 27 11:38:41 nas1 systemd-journald[617]: /var/log/journal/6cbbb533d69b400c852bfff245b1fa40/system.journal: Journal header limits reached or header out-of-date, rotating.

Jun 27 11:38:41 nas1 systemd-journald[617]: Failed to set ACL on /var/log/journal/6cbbb533d69b400c852bfff245b1fa40/user-1000.journal, ignoring: Operation not supported

Jun 28 00:38:18 nas1 kernel: loop0: detected capacity change from 0 to 2575752

mav · June 28, 2024, 4:05pm

@das1996 Do you have ARC statistics to look for unexpected/excessive drops? To reduce possible ones I am thinking to set also zfs_arc_pc_percent to 200 or 300.

das1996 · June 28, 2024, 4:22pm

What specifically do you want me to look at? I never really looked at zfs stats before as my use case doesn’t benefit all that much from cache (usually lots of sustained writing, or reading).

mav · June 28, 2024, 4:34pm

Any erratic ARC size behavior. Random drops, leaving plenty of free RAM, etc.

das1996 · June 28, 2024, 5:10pm

The dip to 0 between 6/26 and 6/27 was when it was last rebooted following the zfs_arc_shrinker_limit set to 0. Does this graph tell you anything useful?

mav · June 28, 2024, 5:27pm

I see a number of deep dips. If memory at those times was used by some apps, VMs or something else useful – that is just fine. If it was free – that may be wrong, especially if you haven’t deleted anything massively, that could cause legal ARC evictions. If it was used by page cache, then setting zfs_arc_pc_percent=300 should reduce those to what I think should be a reasonable default. Unless somebody have other ideas, I am going to include that into next 24.04.2 release.

sfatula · June 28, 2024, 5:41pm

I thought it was a percent, i.e., 0 to 100?

mav · June 28, 2024, 5:55pm

It is a percent, but values above 100 are officially legal there. It just means that ARC will not agree to shrink to less than than 3x of file-backed portion of page cache. And if something need more memory, kernel will need to shrink page cache first, that for file-backed pages should be possible even without swap.

sfatula · June 28, 2024, 9:46pm

Ah, so the doc is wrong then.

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-arc-pc-percent

It says the range is 0 to 100.

das1996 · June 29, 2024, 1:39pm

It’s been 3 days now since the " zfs_arc_shrinker_limit" change. No oom crashes, local console remains accessible. I do see occasional messages such as

nas1 kernel: loop0: detected capacity change from 0 to 2575752
systemd-journald[617]: /var/log/journal/6cbbb533d69b400c852bfff245b1fa40/system.journal: Journal header limits reached or header out-of-date, rotating.

Not sure why as there are no vm’s or apps running, just straight forward nas.

What is loop0 device?

ivonnyssen · June 30, 2024, 8:56am

I see an OOM kill every day at midnight - 11 seconds after. I moved the scrub jobs to 1am and 2am, so they are no longer involved. I also created a ticket and attached the debug in case it helps.

neofusion · June 30, 2024, 10:16am

At the same time, the description just about that table says it can exceed 100.
Confusingly written.

Joel_Gray · July 4, 2024, 2:54am

Hi All, thread has gone quiet so I assume most people are having some success with the change.

Unfortunately I am still seeing some of these issues. Under heavy NFS workloads the OOM errors have stopped with the changes provided, but there are still a lot of instances which trigger it.

Loading the Audit logs from the GUI is one of them, but more importantly when trying to delete a 2.5tb backup file via a Windows SMB share caused a slew of OOM errors again and the system ultimately crashed.

It is crashing again when trying to re-create a backup file of the same size via SMB via a 10gig link. This was to a 24tb Raid-Z2 HDD pool.

I will pull logs when I can, I have limited time today so am focusing on adding an additional backup option as this many crashes of core system services has me worried. If the system has triggered a lot of the OOM-errors, a normal shutdown sometimes takes 15+ minutes with the disk activity on all pools pinned at 100% for the entire duration.

PhilD13 · July 4, 2024, 4:19pm

I think in this context it would be the file backed portion of the page cache.

Technically it’s defined as
A virtual block device in Unix-like operating systems that allows files to be mounted as if they were a physical disk or partition. It creates a pseudo-device that can emulate storage media, enabling users to access and manipulate files or disk images as if they were actual devices.