RAM Size Guidance for Dragonfish

Additional datapoints after the system being left entirely unattended since my last post ~22 hours ago: Imgur: The magic of the Internet

In the first picture, you can see that a couple of hours after my last post here, swap usage again occurred. This is despite free RAM remaining at the same amount at the time swap was engaged, and in the middle of the graph in the first picture you can see that at some point a bit of swap usage went down, available RAM went back up, and then roughly 4 hours later, free RAM went down and swap usage was this time higher than previously.

In the second picture, you can see me simply trying to load the Disks page of the web UI just now. After only a bit over 2 days of uptime, the web UI is now so slow that this pages takes almost 15 seconds to load, instead of the usual 2-3 seconds it would normally take. This behaviour is repeated on other pages such as the Reporting/graphs one.

The third photo shows Disk I/O activity on the Samsung SATA SSD housing my SCALE install, for the last 24 hours. Here we can see that right before Swap usage started happening in the first photo, a ton of writes happened to the SSD for roughly an hour straight.

The fourth photo shows CPU metrics for the last 24 hours. Note the sporadic/raised CPU usage from 06:00 to 07:00, despite the system being unused at that time, a spike at 11:00 with others over the course of an hour, a minor increase at 13:00 to 14:00, and the the system was completely idle until 19:00 when there was a ramp in CPU usage that has persisted and is still continuing.

Throughout this time the system was unused entirely, by anyone, and I was actually asleep at 19:00 when the CPU usage started up again.

In the fifth/last photo, I ran ‘top’ on the SCALE GUI Shell and, sure enough, the top CPU culprit is middlewared which has 4 processes doing… whatever middlewared does on 4 processes. This is constant, continual, presumably has been the case since 19:00 in the above photo, and I’ll now need to go and reboot SCALE once again in order to reclaim the quarter of my CPU that middlewared is using.

This may all be moot or otherwise irrelevant once the next SCALE update comes out with those aforementioned fixes in it, we’ll have to see and hope. I probably won’t post in here again since it’d just be me reposting the same repeating behaviour, but hopefully it’ll encourage others to share data from their systems as well.

1 Like

Curious, what value does this command yield?

cat /sys/module/zfs/parameters/zfs_arc_max

I removed my postinit value from cobia before updating to Dragonfish and it 0 for me on dragonfish with default settings.

What does this reveal?

arc_summary | grep "Max size"

And how much physical RAM is available to the OS?

Total available Memory is 62.7 GiB

1 Like

Interesting. So this “tweak” isn’t simply changing the parameter’s value upon bootup. They must have modified the ZFS code itself for SCALE?

Because “0” is the “operating system default”, which for upstream OpenZFS for Linux is 50% of RAM. However, even though you’re using “0” for the default… it’s set to exactly 1 GiB less than physical RAM. (AKA: The “FreeBSD way”.)

@bitpushr, do you find any relief to these issues if you apply this “fix”, and then reboot?

Confirm the change is in effect (after you reboot) with this command:

arc_summary | grep "Max size"

Simply outputs “0”

Take a look.

What about this?

I know it will require a reboot, so whenever it’s convenient for you.

Have set it, there’s 64GB in my system as well so just copied the command from the post you linked and set it to Post Init, but can’t reboot the system currently and will then have to observe system behaviour for at least 24hrs after changing it to see if there’s any differences.

1 Like

Don’t do that! The user has 128 GiB of RAM, not 64!

You need to calculate what 50% of your RAM is to use for that value.

1 Like

Ah, true, nice catch. Done. Will chime back in probably in a couple of days’ time once I’ve found a window to reboot the system and give it a day or two to observe.

1 Like

Thanks for this.
This will help to see if there are any specific patterns for when the issue occurs.

1 Like

Just chiming in to update: have just rebooted the SCALE system after applying the previously recommended changes.

Running "arc_summary | grep “Max size” gives a result of ‘Max size (high water): 16:1 32.0 GiB’

I’ll now leave the system completely unattended, as I normally would, for at least 24hrs before looking back at it and chiming back in.

1 Like

62 Truecharts apps and 256 GB RAM.

Update to DF yesterday and I’m having terrible performance issues, pretty much the same as it was on Cobia. I cleaned all my snapshots and my system is stable apart of the apps. Editing and Saving an app is taking a lot of time. Often the whole ui crash and is stuck for 5 minutes on the login loading page.

I have migrated another server with fewer apps and 32 GB RAM and this server is snappy responsive.

Please IX, hear me out: Fix the rubbish gui. It doesn’t make any sense with 256 GB and 72 cores to have something that feels such broken.

Just seeing this reply to my post now. FYI it now shows as “34360000000”, can reply once the post-init workaround is removed in the future if needed.

After upgrading to Dragonfish my system would lock up completely and stop responding to network and console when running Cloud Sync jobs (even a “dry run” would cause a crash). I am syncing with Backblaze and have the “–fast-list” option checked (don’t know if that makes a difference though).
Limiting ARC to 50% solved this, and the system now seems to be running stably again.
This is a ten year old server that have been absolutely stable through all upgrades of Free-/True-nas. It has 16GB RAM and is running a number of docker containers (mariadb, grafana, nextcloud influx, plex etc.).

1 Like

Dropping my own thread in as it looks to be related to swap as well. The issue with the boot drive can be ignored as it’s unrelated and I think that’s my own fault for rebooting the system so readily (though it certainly was an odd issue).

As I mentioned in the most recent post, swap usage is way up compared to Cobia, so I’m playing around with ARC limits to ensure I don’t hit this. I’ve also temporarily outright disabled swap to avoid running into issues while I work :stuck_out_tongue:

@kris @Captain_Morgan

Is zram a viable alternative to outright disabling swap?

I’ve had success with it on Arch Linux (non-server, non-NAS), but I’m wondering if it would serve SCALE users well?

  • No need for a swap partition / swap-on-disk
  • Anything that needs to be “swapped” will remain in RAM in a compressed format (ZSTD)

So under ideal conditions, it never gets used. However, to prevent OOM situations, there’s a non-disk safety net that should theoretically work at the speed of your RAM.

I’m not sure if there are caveats of zram in the presence of ZFS, however.