Going insane, truenas webUI dies, again

Then how do we revert to 50% where should I do that and how to get there?

See above, or via this link:

Thank you, it worked

1 Like

Mine is using even less now, still unmodified.

I just started seeing this on my system that I upgraded from CORE to SCALE. When I was running the release candidate this was not occurring but after upgrading to the official release my system started to go unresponsive every night. My monitoring system (Nagios) would show CPU hitting 100% and the system processes slowly increasing until eventually it would start timing out. WebUI and SSH would go unresponsive and the only way to recover was to hard reset the system through the IPMI module. When it would come back up there were no events in the logs to debug with as all logging just stopped at the same time the CPU hit 100%. I’ve implemented this fix and rebooted the system. Hoping this will address. :crossed_fingers:

In an update to my post above I found something interesting. It seems that ZFS kernel settings are getting preserved from CORE to SCALE upgrades. So, for users that performed upgrades check your System Settings → Advanced in the SYSCTL section. It seems that the upgrade carried over the vfs.zfs.arc_max setting which for me was set to 95% of my system memory. I only found this after I was using the script discussed here to set the parmater directly in the mentioned files. Which the system let me but I noticed no affect until after I disabled the setting in SYSCTL. Could this be a possible bug in that these settings should NOT be preserved from CORE to SCALE? :thinking: I mention this as I’ve seen talk that a fresh install doesn’t have this issue or resolves it.

I don’t believe this is what causes the issue, as I am running from an initial install of Scale (upgraded from Bluefin to Cobia to Dragonfish), so nothing would have been able to have been carried over from Core.
In any case default for Dragonfish is very high:

ARC size (current):                                    71.1 %   88.7 GiB
        Target size (adaptive):                        71.0 %   88.6 GiB
        Min size (hard limit):                          3.2 %    3.9 GiB
        Max size (high water):                           31:1  124.7 GiB

Seems to be an issue with swapping specific to Dragonfish as I was running zfs_arc_max=95% & zfs_arc_sys_free=1GiB (equivalent in bytes) on Cobia for a period of time with no issues, this was from some testing that was done late 2023. That worked fine, but issues started only after the upgrade.

1 Like

So, you were running at 95% on Cobia (not just testing) to be sure, and it all changed on Dragonfish but you are still at 95%? Is that correct @essinghigh ? That’s interesting.

To clarify, my CORE system has it set at 95% also with no issues. What I am suggesting is that something in the latest SCALE Dragonfish release version has an issue with this setting. As I mentioned I didn’t have this issue until I moved to the official release. The Dragonfish RC didn’t have any issues at all for me.

1 Like

I’d been running 95% for probably 1-2 weeks on Cobia around the time the thread started, then I switched to 90% with zfs_arc_sys_free at 8GiB and ran that consistently with no issues until the upgrade to Dragonfish. With the same config on Dragonfish, with swap enabled, I can reproduce the issue. Wasn’t able to on Cobia.

If I disable swap on Dragonfish it runs fine, so seems to be some issue with the system swapping way too aggressively instead of resizing ARC properly.

I think in future I’ll run without swap and a defined zfs_arc_sys_free value as I have more than enough memory to not run into an OOM condition (and swap just generally is something to be avoided except for as a safety-net).

EDIT: to add some clarifications, what I’m running right now to test is Dragonfish defaults (arc_max=0 & sys_free=0) with swap disabled, and that works fine.

To add to what @essinghigh says, I upgraded my system from CORE to Cobia and ran it for about a week but had all kinds of issues with the apps. The forums said to resolve them update to the Dragonfish RC which solved my issues. System was stable for a month while I sorted out my install of Nextcloud and some VMs that I have on the system. When the official Dragonfish was released I updated and ran fine for about 3-4 days. After that is when the symptoms started for me. Now it happens to me just about every night when I replicate my datastores between my two systems for backup. The SCALE system goes dark and the CORE system just keeps on trucking. Also only the WebUI and other core services are affected. All my VMs and Apps keep right on trucking with no issue.

1 Like

To sum up what I have done to date on my SCALE Dragonfish system:

/sys/module/zfs/parameters/zfs_arc_max set to 50% of memory
/sys/module/zfs/parameters/zfs_arc_sys_free set to 16G
disabled vfs.zfs.arc_max in System Settings → Advanced → Sysctl

I’ll run this overnight and see what happens.

Cross-posting for visibility.

We can confirm that the significant contributor to the issue that has been reported in this thread and others is related to a Linux kernel change in 6.6. While swap adjustments proposed can mitigate the issue to some degree, the multi-gen LRU code in question will be disabled by default in a subsequent Dragonfish patch release, and should resolve the problem. This can be done in advance following the directions below from @mav. Thanks to all of the community members who helped by reporting the issue which aided in reaching the root cause. Additional feedback is welcome from those who apply the change below.

1 Like

Thanks @essinghigh , that little script has got my box back under control. Really appreciated!

Have you tried disabling multi-gen LRU as mentioned in @mav’s post?
The script worked well for me as a breakfix however Dragonfish seems to now be working fine after disabling this and letting ARC run at it’s defaults.

Not yet. I saw your script yesterday and could get that done last night. I didn’t see the multi-gen LRU one until this morning and I don’t trust myself with Linux shell at 6:30am & not enough coffee :wink:

1 Like

I get permission denied. I also get that if I try and Sudo it.

Ok, weirdly worked when I changed to my scratch drive. Can’t explain that one. Main thing is I can still log in whilst transferring all my data onto the NAS for the final go-live.

Hello,
Not sure if this is the same problem but also my Scale is slowing down (but not blocked).


This is the output of

top -o VIRT

command

I can confirm that turning off lru_gen solves the excessive-swapping problem on my test system. (My prod system won’t upgrade until later…) That’s with no changes to arc_max or arc_sys_free, or swappiness.

Of course, changing (just) lru_gen also raises the free memory under pressure from 1.2GB to 6.2GB (on a 32G system), severely reducing memory pressure at the cost of ~5GB. I note that setting arc_sys_free to 6GB instead reduces the intensity of persistent swapping but does not eliminate it outright, so it looks like lru_gen itself changes the interaction between the kernel’s memory system and zfs’s arc.

Thanks to everybody who contributed to this thread; it got me a much better starting point on figuring this out.

Cheers
– perry

1 Like