Yea that can be ignored. Squashfs is used as part of the update system in the background, that message is normal.
I had another OOM kill12 seconds after midnight. The only thing that was scheduled at the time was a scrub task for both pools. It looks like neither scrub task completed, as the dashboard still shows “never” for the last run. There should not have been any other pressure on the system at that time.
Scrub can be relatively memory-intensive, especially if several pools are scrubbed same time. You should try to spread them. And IIRC that memory is not counted towards ARC, so it may create a spike of system-wide memory pressure, and ARC is supposed to shrink on a signal from OS, if needed. I’d need to know more about it. TrueNAS debug would be good.
PS: Whether scrub was run I would look directly via zpool status
.
Spool status shows that scan only ever ran for the boot pool. I have spread the scrub tasks by an hour for now and will see how that goes. It seems odd though, as the system has 64GB and is doing nothing else but site there at that time. I downloaded a debug, happy to upload that somewhere if it helps.
Probably ran a few minutes only then depending on the device. I am following your journey closely as I am not upgrading until these oom can be fixed since they disabled swap space.
I downloaded a debug, happy to upload that somewhere if it helps.
I don’t know if you can send it to me here, but you could send me a link. Or you may open a ticket, where debugs are uploaded in a way visible only to iX developers.
Still no OOM crashes here. Just some messages like below.
Jun 27 11:38:41 nas1 systemd-journald[617]: Data hash table of /var/log/journal/6cbbb533d69b400c852bfff245b1fa40/system.journal has a fill level at 75.0 (8533 of 11377 items, 6553600 file size, 768 bytes per hash table item), suggesting rotation.
Jun 27 11:38:41 nas1 systemd-journald[617]: /var/log/journal/6cbbb533d69b400c852bfff245b1fa40/system.journal: Journal header limits reached or header out-of-date, rotating.
Jun 27 11:38:41 nas1 systemd-journald[617]: Failed to set ACL on /var/log/journal/6cbbb533d69b400c852bfff245b1fa40/user-1000.journal, ignoring: Operation not supported
Jun 28 00:38:18 nas1 kernel: loop0: detected capacity change from 0 to 2575752
@das1996 Do you have ARC statistics to look for unexpected/excessive drops? To reduce possible ones I am thinking to set also zfs_arc_pc_percent
to 200 or 300.
What specifically do you want me to look at? I never really looked at zfs stats before as my use case doesn’t benefit all that much from cache (usually lots of sustained writing, or reading).
Any erratic ARC size behavior. Random drops, leaving plenty of free RAM, etc.
The dip to 0 between 6/26 and 6/27 was when it was last rebooted following the zfs_arc_shrinker_limit
set to 0. Does this graph tell you anything useful?
I see a number of deep dips. If memory at those times was used by some apps, VMs or something else useful – that is just fine. If it was free – that may be wrong, especially if you haven’t deleted anything massively, that could cause legal ARC evictions. If it was used by page cache, then setting zfs_arc_pc_percent=300
should reduce those to what I think should be a reasonable default. Unless somebody have other ideas, I am going to include that into next 24.04.2 release.
I thought it was a percent, i.e., 0 to 100?
It is a percent, but values above 100 are officially legal there. It just means that ARC will not agree to shrink to less than than 3x of file-backed portion of page cache. And if something need more memory, kernel will need to shrink page cache first, that for file-backed pages should be possible even without swap.
Ah, so the doc is wrong then.
It says the range is 0 to 100.
It’s been 3 days now since the " zfs_arc_shrinker_limit" change. No oom crashes, local console remains accessible. I do see occasional messages such as
nas1 kernel: loop0: detected capacity change from 0 to 2575752
systemd-journald[617]: /var/log/journal/6cbbb533d69b400c852bfff245b1fa40/system.journal: Journal header limits reached or header out-of-date, rotating.
Not sure why as there are no vm’s or apps running, just straight forward nas.
What is loop0 device?
I see an OOM kill every day at midnight - 11 seconds after. I moved the scrub jobs to 1am and 2am, so they are no longer involved. I also created a ticket and attached the debug in case it helps.
At the same time, the description just about that table says it can exceed 100.
Confusingly written.
Hi All, thread has gone quiet so I assume most people are having some success with the change.
Unfortunately I am still seeing some of these issues. Under heavy NFS workloads the OOM errors have stopped with the changes provided, but there are still a lot of instances which trigger it.
Loading the Audit logs from the GUI is one of them, but more importantly when trying to delete a 2.5tb backup file via a Windows SMB share caused a slew of OOM errors again and the system ultimately crashed.
It is crashing again when trying to re-create a backup file of the same size via SMB via a 10gig link. This was to a 24tb Raid-Z2 HDD pool.
I will pull logs when I can, I have limited time today so am focusing on adding an additional backup option as this many crashes of core system services has me worried. If the system has triggered a lot of the OOM-errors, a normal shutdown sometimes takes 15+ minutes with the disk activity on all pools pinned at 100% for the entire duration.
I think in this context it would be the file backed portion of the page cache.
Technically it’s defined as
A virtual block device in Unix-like operating systems that allows files to be mounted as if they were a physical disk or partition. It creates a pseudo-device that can emulate storage media, enabling users to access and manipulate files or disk images as if they were actual devices.