RAM Size Guidance for Dragonfish

winnielinnie · May 5, 2024, 1:05am

I would “ping” the usernames that have posted such threads / replies. They might not really “notice” this thread.

EDIT: The topic title might attract more attention with something like “RAM / swap / performance issues with Dragonfish? Please share your experience in here”

Captain_Morgan · May 5, 2024, 1:06am

I don’t know… hence we want to find real examples if they are happening. In another thread there was one report, but not well documented.

winnielinnie · May 5, 2024, 12:21pm

Some users of note:

@cmplieger posted this. Slowdowns when running qBittorrent after some time, until system eventually freezes. No issue when reverted back to Cobia. (Update: While in Dragonfish, issues are resolved when the ARC is limited to 50% of RAM.)

@Noks posted this. Slowdowns, swapping, freezing, and sluggish web UI with Dragonfish. They resolved their issue by setting a parameter to limit the ARC to 50%.

@anto294 posted this. Same issue, same solution: Limiting the ARC maximum size to 50% of RAM resolved the problem.

@SnowReborn posted this. Slowdowns and freeze-ups that started with Dragonfish. Temporarily resolved with reboots, until it occurs again. (User has not tried changing the parameter to limit ARC to 50% of RAM followed by a reboot to see if it resolves their issues.)

SnowReborn · May 5, 2024, 12:56pm

official dragonfish 24.04.0 Release freshly installed; truenas scale running under exsi. I have 1TB RAM 32cores 2690v4 reserved. I have experienced UI freeze up completely and siginificant speed and I/O throttle 3 times in 4 days while migrating 80TB worth of data from windows client to ZFS. only corrolation i see is that when UI lock up happens, i will have SWAP usage around 15%~20%, and when i stop file trasnfer swap goes down. The top swap usage is by “asyncio_loop”. Every other resource utilization seems LOW, with average about 5~20% CPU usage, and 15GB RAM in service, cool temp for everything. Iperf3 checks up normal bandwidth. Restart solves the issue for 20+ hours. Not sure what triggers it, other than just lots read and writes.

PhilD13 · May 5, 2024, 4:29pm

I recently went from Bluefin to Dragonfish on my primary server and from Cobia to Dragonfish on a secondary server. The only thing these two servers have in common hardware wise is they are both Supermicro and both essentially use some of the same brands of drives. Use wise they are both backup and file servers. The primary also has Tailscale Neither use VM’s or iScsi and both contain a few SMB datasets and shares. There is little file serving and most of the activity would be various computers and laptops sending backups to the primary server. In other words these servers are way overpowered.

These 2 systems have been up for primary: 4 days, and secondary: 5 days after the updates to Dragonfish.

Drive Space:
Data space on the primary is a Usable Capacity:82.63 TiB and about 24.51 TiB or 29.7% available used.
Data space on the secondary is a Usable Capacity: 87.06 TiB and abou 27.3 TiB or 31.4% available used.

Memory:
The primary has 128Gb ram divided between it’s two processors. The secondary has 64Gb ram divided between two processors. Both of the original Truenas installs used the defaults of the install program active for their install program at the time at the time of install. Primary backs up to secondary each night by running Rsync over ssh created within the Data Protection tab and data pushed to the secondary server.

What I have noticed after updating to Dragonfish is the primary server with it’s 128Gb ram actually hits and uses a small bit (779 MiB) of 15GiB swap space. The secondary servers Swap Utilization is 268 KiB out of 9.99 GiB. These values are as reported from the Reports tab >> Memory tab.

What the primary shows for use is
Usable: 125.8 GiB total available (ECC)
Free: 14.6 GiB
ZFS Cache: 105.5 GiB
Services: 5.7 GiB

Secondary server shows for use is:
62.8GiB total available (ECC)
Free: 17.7 GiB
ZFS Cache: 33.5 GiB
Services: 11.6 GiB

Of course these usages vary some depending on what the servers may be doing at the time. But the overall distribution is pretty consistent.

I do not know if the swap usage was typical of the systems before as It didn’t seem to be a concern with anyone before and I never paid attention. The systems before were limited to 75% memory for Cache and 50% memory for cache. Any scripts to alter cache memory were removed before the upgrades to Dragonfish.

It is my opinion that on such lightly loaded and used systems that no swap should be used. To me the occasional hitting of swap would mean the back off or release memory back to system trigger of the ZFS cache in Dragonfish is a bit slow to respond. Which is already suspected and being looked into.

I have so far not experienced any unusual slowdowns, freeze ups, crashes of the GUI, failure of any tasks, ssh access or use through Tailscale. I don’t see any memory exhaustion.

I can see where slow response of releasing cache memory back to the system could cause an out of memory issue for VM’s, or running processes, and a issue of the GUI and other tasks being forced into paging to swap grinding things to a halt at least until the system could release ZFS cache memory and get back to normal.

cmplieger · May 5, 2024, 4:41pm

adding this to the list, truenas is unusable again: Going insane, truenas webUI dies, again

cmplieger · May 5, 2024, 5:33pm

truenas scale running on:
i5-13500
126GiB RAM (non-ECC)
2x raid-z1 w 5 disks each
1x mirror with 2 ssds for apps
1x single disk (SSD) for temp storage
1x SSD for boot

Swap memory being used at 1-2gb even with 90GB used for ZFS cache, limiting cache to 50%, the swap is still getting used.

EDIT: after reboot, with 50% cache lock swapping seems to have stopped

sfatula · May 5, 2024, 6:21pm

Does Dragonfish automatically create swap partitions on the boot drives when installing?

I’ve always had some swap usage even in pre Dragonfish days and never had any issues with it. Not sure what was on it but it never affected performance.

bitpushr · May 5, 2024, 6:32pm

Current TrueCharts docs maintainer here. +1ing to this thread purely because I see some of the issues on my personal system as of DragonFish, but also because of the amount of people reporting on our Discord both the issues described and that limiting ZFS RAM usage to either 50% or 75% completely resolves the problem. We’re pointing people towards this thread.

Relevant SCALE specs include:

– AsRock Rack X470D4U2-2T
– AMD Ryzen 5 5600
– Nvidia Quadro P2000
– 64GB of DDR4 2400Mhz ECC RAM
– LSI 9207-8i SATA/SAS Host Bus Adapter PCIe card running in IT mode
– 8x WD Ultrastar HC550
– Multiple Samsung NVMe M.2 & SATA SSDs for OS & container data storage

My SCALE system’s usecase is mixed; it serves double duty as both an apps and NAS host, accessed primarily over the network via SMB.

Since upgrading from Cobia to DragonFish, I’ve observed the unresponsive GUI issues people have described, very poor responsiveness from the system, and other odd behaviour from the system but not to the extent that services/apps are being stopped or the system becomes unreachable.

Since DF released, the system has never gone below 2GB of RAM available to use. However, Swap usage slowly increases over time as seen in this album here. Missing portions in the graphs are from the Cobia → DragonFish upgrade and the reboot of my SCALE system I just did. You can also see a Dashboard view showing what typical system load looks like.

I’ve also observed since upgrading to DF that at times there is a lot of CPU usage from middlewared, as well as large amounts of “Iowait” on the CPU. Presumably from the low amount of RAM available for things in the system that aren’t ZFS. There are entire multi-hour periods of the day that I’ve observed the system being completely idle with apps and other services disengaged, including times I’ve not even been awake or using the system, with seemingly sporadically high amounts of middlewared CPU usage to the tune of 1-2 entire CPU cores.

My issues definitely aren’t as bad as other people’s, however in general DragonFish is a large regression in overall performance and efficiency compared to Cobia. It also doesn’t help that any time you have the web interface open, CPU usage automatically goes vertical until it’s closed again. I’m almost, not quite but almost, unable to use the Shell in the SCALE GUI after a few days of the system being running because it gets that slow. If I run ‘top’ in the Shell I can actually see parts of the window refreshing before others, that’s how sluggish the GUI is on what I’d class as a modern system.

Having to constantly CTRL + F5 a browser window because of SCALE’s GUI issues is painful enough as it is.

I can see some fixes/changes for 24.04.1 are in the pipeline, here’s hoping this release fixes the issues. The issues described in this thread and, to be honest, described elsewhere with DF, are exactly the same issues I’ve seen reported in various Jira tickets prior to DF’s release. These tickets either went unanswered for days, were closed due to it likely being a TrueCharts issue (clearly not the case) or otherwise ignored, sadly indicative of a larger overall trend in attitude that I’ve noticed on the part of iX post-Bluefin. I called out some of these issues as likely needing to be resolved prior to DF releasing, lest we run into issues soon after release.

Sure enough, approaching two weeks since DF’s release, and reports have been creeping up seemingly everywhere. I can’t help but think DF needed another week or two in the oven.

Davvo · May 5, 2024, 8:54pm

Pinned for a week in order to achieve better visibility.

winnielinnie · May 5, 2024, 9:59pm

An update to this: limiting the ARC to 50% of RAM resolved their issue. (AKA: It’s acting properly, just like it did on Cobia.)

I think we’re seeing more and more evidence that the breaking “change” between Cobia → Dragonfish is indeed the “tweak” to allow the ARC to grow as large as needed, a la “FreeBSD-style”.

Setting the ARC limit back to 50% of RAM simply reverts it to the default upstream OpenZFS parameter for ZFS on Linux.^[1]

I’d wager the reason for this decision upstream is indeed the differences in memory management of Linux vs FreeBSD in regards to ZFS/ARC. Why 50%? I don’t know the exact reason, and maybe someone can prove me wrong, but my guess is that their reasoning was along the lines of: “It’s low enough to prevent issues with non-ARC memory pressure competing with ARC in RAM.” In other words: It’s a good enough “safe” value for Linux systems.

zfs_arc_max default for Linux and FreeBSD ↩︎

SnowReborn · May 6, 2024, 8:14am

This actually starting to make senses if people are failing on small amount of RAM on dragonfish, not because the small RAM causes issue, but small RAM actually makes the issue appear faster than large RAM ARC would; which takes me usually 20+ hours continuously heavy IO to start noticing the issue on my 1TB RAM.

*** I want to add to that I am not ONLY experiencing web freezes, but also significant performance drop; while iperf3 is fine, super low cpu usage, everything in check, when issue happen web UI locks up, my seqential would drop from 800MB/s to 300MB/s, and random 4k , file traversal will be in crawl, 3 Files per second. Basically unusable until a reboot; very tempted to try the 50% cap, but really don’t want to. Strangely why no one got any issue back in 22 / 23 with the ARC over 50% hack? hmmmm*****

etorix · May 6, 2024, 12:00pm

50% was not an arbitrary value: It came from the way the Linux kernel allocates memory.
iXsystems sponsored work on OpenZFS to accomodate a larger ARC in Linux. But maybe the issue really has to be addressed from the side of the kernel itself…

winnielinnie · May 6, 2024, 1:28pm

They were probably setting it to something like 75% or 80%.

The default for FreeBSD is to set the “zfs_arc_max” to RAM minus 1 GiB.

Therefore, if you have 32 GiB of RAM? The “zfs_arc_max” is set to 31 GiB.

If you have 128 GiB of RAM? The “zfs_arc_max” is set to 127 GiB.

From what I understand, Dragonfish sets “zfs_arc_max” to be on par with FreeBSD. Which I assume means “RAM minus 1 GiB”.

That’s radically different than someone setting it to 50% or 75% or 80%.

Only iXsystems can answer this question (since I’m too stupid and lazy to search through the source code): My guess is that with Dragonfish, it sets the module parameter zfs_arc_max at every bootup to equal total RAM minus 1 GiB.

bitpushr · May 6, 2024, 4:37pm

Additional datapoints after the system being left entirely unattended since my last post ~22 hours ago: Imgur: The magic of the Internet

In the first picture, you can see that a couple of hours after my last post here, swap usage again occurred. This is despite free RAM remaining at the same amount at the time swap was engaged, and in the middle of the graph in the first picture you can see that at some point a bit of swap usage went down, available RAM went back up, and then roughly 4 hours later, free RAM went down and swap usage was this time higher than previously.

In the second picture, you can see me simply trying to load the Disks page of the web UI just now. After only a bit over 2 days of uptime, the web UI is now so slow that this pages takes almost 15 seconds to load, instead of the usual 2-3 seconds it would normally take. This behaviour is repeated on other pages such as the Reporting/graphs one.

The third photo shows Disk I/O activity on the Samsung SATA SSD housing my SCALE install, for the last 24 hours. Here we can see that right before Swap usage started happening in the first photo, a ton of writes happened to the SSD for roughly an hour straight.

The fourth photo shows CPU metrics for the last 24 hours. Note the sporadic/raised CPU usage from 06:00 to 07:00, despite the system being unused at that time, a spike at 11:00 with others over the course of an hour, a minor increase at 13:00 to 14:00, and the the system was completely idle until 19:00 when there was a ramp in CPU usage that has persisted and is still continuing.

Throughout this time the system was unused entirely, by anyone, and I was actually asleep at 19:00 when the CPU usage started up again.

In the fifth/last photo, I ran ‘top’ on the SCALE GUI Shell and, sure enough, the top CPU culprit is middlewared which has 4 processes doing… whatever middlewared does on 4 processes. This is constant, continual, presumably has been the case since 19:00 in the above photo, and I’ll now need to go and reboot SCALE once again in order to reclaim the quarter of my CPU that middlewared is using.

This may all be moot or otherwise irrelevant once the next SCALE update comes out with those aforementioned fixes in it, we’ll have to see and hope. I probably won’t post in here again since it’d just be me reposting the same repeating behaviour, but hopefully it’ll encourage others to share data from their systems as well.

winnielinnie · May 6, 2024, 5:03pm

Curious, what value does this command yield?

cat /sys/module/zfs/parameters/zfs_arc_max

LarsR · May 6, 2024, 5:05pm

I removed my postinit value from cobia before updating to Dragonfish and it 0 for me on dragonfish with default settings.

winnielinnie · May 6, 2024, 5:07pm

What does this reveal?

arc_summary | grep "Max size"

And how much physical RAM is available to the OS?

LarsR · May 6, 2024, 5:09pm

Total available Memory is 62.7 GiB

winnielinnie · May 6, 2024, 5:09pm

Interesting. So this “tweak” isn’t simply changing the parameter’s value upon bootup. They must have modified the ZFS code itself for SCALE?

Because “0” is the “operating system default”, which for upstream OpenZFS for Linux is 50% of RAM. However, even though you’re using “0” for the default… it’s set to exactly 1 GiB less than physical RAM. (AKA: The “FreeBSD way”.)