RAM Size Guidance for Dragonfish

cmplieger · May 5, 2024, 4:41pm

adding this to the list, truenas is unusable again: Going insane, truenas webUI dies, again

cmplieger · May 5, 2024, 5:33pm

truenas scale running on:
i5-13500
126GiB RAM (non-ECC)
2x raid-z1 w 5 disks each
1x mirror with 2 ssds for apps
1x single disk (SSD) for temp storage
1x SSD for boot

Swap memory being used at 1-2gb even with 90GB used for ZFS cache, limiting cache to 50%, the swap is still getting used.

EDIT: after reboot, with 50% cache lock swapping seems to have stopped

sfatula · May 5, 2024, 6:21pm

Does Dragonfish automatically create swap partitions on the boot drives when installing?

I’ve always had some swap usage even in pre Dragonfish days and never had any issues with it. Not sure what was on it but it never affected performance.

bitpushr · May 5, 2024, 6:32pm

Current TrueCharts docs maintainer here. +1ing to this thread purely because I see some of the issues on my personal system as of DragonFish, but also because of the amount of people reporting on our Discord both the issues described and that limiting ZFS RAM usage to either 50% or 75% completely resolves the problem. We’re pointing people towards this thread.

Relevant SCALE specs include:

– AsRock Rack X470D4U2-2T
– AMD Ryzen 5 5600
– Nvidia Quadro P2000
– 64GB of DDR4 2400Mhz ECC RAM
– LSI 9207-8i SATA/SAS Host Bus Adapter PCIe card running in IT mode
– 8x WD Ultrastar HC550
– Multiple Samsung NVMe M.2 & SATA SSDs for OS & container data storage

My SCALE system’s usecase is mixed; it serves double duty as both an apps and NAS host, accessed primarily over the network via SMB.

Since upgrading from Cobia to DragonFish, I’ve observed the unresponsive GUI issues people have described, very poor responsiveness from the system, and other odd behaviour from the system but not to the extent that services/apps are being stopped or the system becomes unreachable.

Since DF released, the system has never gone below 2GB of RAM available to use. However, Swap usage slowly increases over time as seen in this album here. Missing portions in the graphs are from the Cobia → DragonFish upgrade and the reboot of my SCALE system I just did. You can also see a Dashboard view showing what typical system load looks like.

I’ve also observed since upgrading to DF that at times there is a lot of CPU usage from middlewared, as well as large amounts of “Iowait” on the CPU. Presumably from the low amount of RAM available for things in the system that aren’t ZFS. There are entire multi-hour periods of the day that I’ve observed the system being completely idle with apps and other services disengaged, including times I’ve not even been awake or using the system, with seemingly sporadically high amounts of middlewared CPU usage to the tune of 1-2 entire CPU cores.

My issues definitely aren’t as bad as other people’s, however in general DragonFish is a large regression in overall performance and efficiency compared to Cobia. It also doesn’t help that any time you have the web interface open, CPU usage automatically goes vertical until it’s closed again. I’m almost, not quite but almost, unable to use the Shell in the SCALE GUI after a few days of the system being running because it gets that slow. If I run ‘top’ in the Shell I can actually see parts of the window refreshing before others, that’s how sluggish the GUI is on what I’d class as a modern system.

Having to constantly CTRL + F5 a browser window because of SCALE’s GUI issues is painful enough as it is.

I can see some fixes/changes for 24.04.1 are in the pipeline, here’s hoping this release fixes the issues. The issues described in this thread and, to be honest, described elsewhere with DF, are exactly the same issues I’ve seen reported in various Jira tickets prior to DF’s release. These tickets either went unanswered for days, were closed due to it likely being a TrueCharts issue (clearly not the case) or otherwise ignored, sadly indicative of a larger overall trend in attitude that I’ve noticed on the part of iX post-Bluefin. I called out some of these issues as likely needing to be resolved prior to DF releasing, lest we run into issues soon after release.

Sure enough, approaching two weeks since DF’s release, and reports have been creeping up seemingly everywhere. I can’t help but think DF needed another week or two in the oven.

Davvo · May 5, 2024, 8:54pm

Pinned for a week in order to achieve better visibility.

winnielinnie · May 5, 2024, 9:59pm

An update to this: limiting the ARC to 50% of RAM resolved their issue. (AKA: It’s acting properly, just like it did on Cobia.)

I think we’re seeing more and more evidence that the breaking “change” between Cobia → Dragonfish is indeed the “tweak” to allow the ARC to grow as large as needed, a la “FreeBSD-style”.

Setting the ARC limit back to 50% of RAM simply reverts it to the default upstream OpenZFS parameter for ZFS on Linux.^[1]

I’d wager the reason for this decision upstream is indeed the differences in memory management of Linux vs FreeBSD in regards to ZFS/ARC. Why 50%? I don’t know the exact reason, and maybe someone can prove me wrong, but my guess is that their reasoning was along the lines of: “It’s low enough to prevent issues with non-ARC memory pressure competing with ARC in RAM.” In other words: It’s a good enough “safe” value for Linux systems.

zfs_arc_max default for Linux and FreeBSD ↩︎

SnowReborn · May 6, 2024, 8:14am

This actually starting to make senses if people are failing on small amount of RAM on dragonfish, not because the small RAM causes issue, but small RAM actually makes the issue appear faster than large RAM ARC would; which takes me usually 20+ hours continuously heavy IO to start noticing the issue on my 1TB RAM.

*** I want to add to that I am not ONLY experiencing web freezes, but also significant performance drop; while iperf3 is fine, super low cpu usage, everything in check, when issue happen web UI locks up, my seqential would drop from 800MB/s to 300MB/s, and random 4k , file traversal will be in crawl, 3 Files per second. Basically unusable until a reboot; very tempted to try the 50% cap, but really don’t want to. Strangely why no one got any issue back in 22 / 23 with the ARC over 50% hack? hmmmm*****

etorix · May 6, 2024, 12:00pm

50% was not an arbitrary value: It came from the way the Linux kernel allocates memory.
iXsystems sponsored work on OpenZFS to accomodate a larger ARC in Linux. But maybe the issue really has to be addressed from the side of the kernel itself…

winnielinnie · May 6, 2024, 1:28pm

They were probably setting it to something like 75% or 80%.

The default for FreeBSD is to set the “zfs_arc_max” to RAM minus 1 GiB.

Therefore, if you have 32 GiB of RAM? The “zfs_arc_max” is set to 31 GiB.

If you have 128 GiB of RAM? The “zfs_arc_max” is set to 127 GiB.

From what I understand, Dragonfish sets “zfs_arc_max” to be on par with FreeBSD. Which I assume means “RAM minus 1 GiB”.

That’s radically different than someone setting it to 50% or 75% or 80%.

Only iXsystems can answer this question (since I’m too stupid and lazy to search through the source code): My guess is that with Dragonfish, it sets the module parameter zfs_arc_max at every bootup to equal total RAM minus 1 GiB.

bitpushr · May 6, 2024, 4:37pm

Additional datapoints after the system being left entirely unattended since my last post ~22 hours ago: Imgur: The magic of the Internet

In the first picture, you can see that a couple of hours after my last post here, swap usage again occurred. This is despite free RAM remaining at the same amount at the time swap was engaged, and in the middle of the graph in the first picture you can see that at some point a bit of swap usage went down, available RAM went back up, and then roughly 4 hours later, free RAM went down and swap usage was this time higher than previously.

In the second picture, you can see me simply trying to load the Disks page of the web UI just now. After only a bit over 2 days of uptime, the web UI is now so slow that this pages takes almost 15 seconds to load, instead of the usual 2-3 seconds it would normally take. This behaviour is repeated on other pages such as the Reporting/graphs one.

The third photo shows Disk I/O activity on the Samsung SATA SSD housing my SCALE install, for the last 24 hours. Here we can see that right before Swap usage started happening in the first photo, a ton of writes happened to the SSD for roughly an hour straight.

The fourth photo shows CPU metrics for the last 24 hours. Note the sporadic/raised CPU usage from 06:00 to 07:00, despite the system being unused at that time, a spike at 11:00 with others over the course of an hour, a minor increase at 13:00 to 14:00, and the the system was completely idle until 19:00 when there was a ramp in CPU usage that has persisted and is still continuing.

Throughout this time the system was unused entirely, by anyone, and I was actually asleep at 19:00 when the CPU usage started up again.

In the fifth/last photo, I ran ‘top’ on the SCALE GUI Shell and, sure enough, the top CPU culprit is middlewared which has 4 processes doing… whatever middlewared does on 4 processes. This is constant, continual, presumably has been the case since 19:00 in the above photo, and I’ll now need to go and reboot SCALE once again in order to reclaim the quarter of my CPU that middlewared is using.

This may all be moot or otherwise irrelevant once the next SCALE update comes out with those aforementioned fixes in it, we’ll have to see and hope. I probably won’t post in here again since it’d just be me reposting the same repeating behaviour, but hopefully it’ll encourage others to share data from their systems as well.

winnielinnie · May 6, 2024, 5:03pm

Curious, what value does this command yield?

cat /sys/module/zfs/parameters/zfs_arc_max

LarsR · May 6, 2024, 5:05pm

I removed my postinit value from cobia before updating to Dragonfish and it 0 for me on dragonfish with default settings.

winnielinnie · May 6, 2024, 5:07pm

What does this reveal?

arc_summary | grep "Max size"

And how much physical RAM is available to the OS?

LarsR · May 6, 2024, 5:09pm

Total available Memory is 62.7 GiB

winnielinnie · May 6, 2024, 5:09pm

Interesting. So this “tweak” isn’t simply changing the parameter’s value upon bootup. They must have modified the ZFS code itself for SCALE?

Because “0” is the “operating system default”, which for upstream OpenZFS for Linux is 50% of RAM. However, even though you’re using “0” for the default… it’s set to exactly 1 GiB less than physical RAM. (AKA: The “FreeBSD way”.)

winnielinnie · May 6, 2024, 5:15pm

@bitpushr, do you find any relief to these issues if you apply this “fix”, and then reboot?

Confirm the change is in effect (after you reboot) with this command:

arc_summary | grep "Max size"

bitpushr · May 6, 2024, 5:23pm

Simply outputs “0”

Davvo · May 6, 2024, 5:25pm

Take a look.

winnielinnie · May 6, 2024, 5:25pm

What about this?

winnielinnie:

@bitpushr, do you find any relief to these issues if you apply this “fix” , and then reboot?

Confirm the change is in effect (after you reboot) with this command:
arc_summary | grep "Max size"

I know it will require a reboot, so whenever it’s convenient for you.

bitpushr · May 6, 2024, 5:27pm

Have set it, there’s 64GB in my system as well so just copied the command from the post you linked and set it to Post Init, but can’t reboot the system currently and will then have to observe system behaviour for at least 24hrs after changing it to see if there’s any differences.