Dragonfish swap usage/high memory/memory leak issues

In the aforementioned (previous post of mine) retrospect, hopefully this results in iX perhaps changing some QA parameters for future SCALE releases. Presumably, there’s a handful of iX machines that are reserved for QA use, so perhaps changing the setup on some of those to reflect how things are done on DIY/Generic SCALE systems would help to avoid this in the future.

I get only wanting to do first-party testing on hardware and software combinations that iX actually sells, but if I had to guess I’d say the vast majority of SCALE’s userbase is on DIY/Generic systems, so how those are setup and run/maintained should ultimately be accounted for as part of the QA or validation process.

Alternatively, align SCALE’s minimum/recommended specs to match those of hardware that iX can reliably ensure performance on (this would have ultimately meant the responsibility for this issue still fell on iX QA as there wouldn’t have been a scenario where adequate RAM wasn’t available to the system, since ZFS appeared to just be using all available system RAM minus 1GB), or (hot take) drop DIY/Generic SCALE support or branch it out to “SCALE for DIY” if it’s simply too much work for iX to validate against.

2 Likes

I don’t want to derail the discussion, but maybe swap should be disabled by default for everyone? It could be optionally enabled for users that don’t provision enough RAM and then you don’t have to provide support for such users by flagging such unsupported configurations. I don’t see much benefit for degrading performance so badly that it is completely unusable (with swap) vs. the OOM killer kicking in to free memory (no swap).

For users with plenty of memory, the pattern shows up as a huge ARC with swap 100% used. I could see something was wrong when my system with 128 GB RAM had 80-90 GB of ARC but swap was 100% used and the system was very slow. Hindsight is 20/20 but just a pattern to consider for the future.

With the above LRU fix, the “swap” issue is no longer an issue.

Swapping was a symptom, not a cause, of the underlying issue.

That’s why this combination makes the most sense for everyone, and does not hinder performance:

  • No 50% ARC limit
  • LRU fix
  • swappiness = 1

It also allows for swap to be used for such dire situations where your RAM cannot handle extreme/peak memory demands.

Besides, even in this discussion there are those who argue in favor of swapping in general, even if your RAM is not hitting its limits.

3 Likes

Understand that, but the point was brought up that problem reports were discounted by iX because swap was involved while they explicitly disable it for their systems. If there is a common, supported configuration without swap, that issue becomes irrelevant.

It would have been pretty obvious something was very wrong if no swap was configured, 90 GB out of 128 GB total RAM was used for ARC and the system was repeatedly invoking the OOM killer, right?

They could optionally enable swap and then iX can selectively ignore their problem reports then.

I think there’s a little bit of “trying to have it both ways” going on here on iX’ part. On the one hand, they don’t want to raise the minimum spec too much, to avoid scaring off prospective users[1]. But on the other hand, there seems to be some (understandable) dismissiveness of inadequately-resourced systems.


  1. We haven’t seen it too much in recent years–because this is 2024, and 8 GB just isn’t that much any more–but I remember the hue and cry when the “minimum requirement” for FreeNAS was increased to 8 GB. ↩︎

2 Likes

Fix is being test right now for .1, but fundamentally we had two different issues here which were being hit by users in a variety of ways.

First, we had the bad ARC/Kernel LRU behaviors, where memory was being too aggressively pushed out into Swap space and causing poor system performance and stability issues.

Second swap in general was another issue by itself. Issue there was twofold.

  1. We don’t test with swap on enterprise hardware, it is disabled.

  2. Swap the way it was originally implemented back in the day gave users an option to put swap partition on boot devices. The ARC issue only exposed one aspect of this design problem. What happens in general if you have SWAP usage which is thrashing your boot device? Turns out the answer is that all sorts of no-good things, including performance, stability issues and other random undefined exotic behaviors that mask the real issue. Never mind that many systems don’t use quality boot devices and all that constant extra write load can wear them out too quickly.

After consideration of both sets of of real-world problems, in .1 we will disable swap as a default AND enable the LRU / swapiness fixes. Users who want that safety cushion of on-disk space can re-enable it if they so desire. When re-enabled the ARC LRU / swapiness settings will be in effect and hopefully prevent swap from being used too aggressively, but remember once you do start to swap too much you are opening the doors to a whole host of other unpredictable behaviors.

While ARC brought the issue to the forefront, we still have too many other reports of general instability where swap is heavily utilized. Apps / VMs being used more often on SCALE I’m sure helps drive this behavior. Turns out you generally don’t want to deploy 20+ applications on a system with only 16GB of memory. :wink:

4 Likes

@Kris If a system is currently configured with swap (as it is/was the default during install) and a person wishes to no longer have swap going forward, what would be the best least intrusive and safest data wise suggested path to disabling and removing swap and removing the swap partitions?

At the moment for .1, I’d say we first give some soak time. The partitions being present won’t necessary hurt anything on the system or boot device, they just don’t get utilized unless the user explicitly re-enables them. Longer term we may either auto-remove unused ones, or provide a clear guide on how to do so safely after we’ve written up and tested proper procedures for doing so.

2 Likes

To confirm, an already existing install will have swap disabled installing 24.04.1, it’s not limited to fresh installs from .iso?
(the partitions will stick around but that’s a minor issue)

That sounds like a good path to work forward.
Thanks!

Yes, upgrades and fresh will default to disabled, until user explicitly enables.

4 Likes

We are having it both ways - deliberately

What we sell and support is professional or enterprise quality. RAM costs are not an issue, 32GB is the minimum we use.

However, we understand that in personal/home lab applications, TrueNAS is used on lower cost or 2nd hand hardware. We verify 8GB works for basic storage. However, it is limited for VMs and Apps and won’t perform well with a lot of drives. The guidelines are reasonable.

3 Likes

Perhaps SCALE could have an alert pop up when the user attempts to open the Apps or VM page or otherwise initialise/start using Kubernetes in it, if SCALE’s installed on a system with <16GB of RAM. Doesn’t have to be anything drastic, just that such a limited amount of RAM available for everything SCALE is trying to do could result in performance issues. I think it’s pretty reasonable to assume almost any system that has less than 16GB of RAM is running with 8GB, except for that group of people running SCALE on triple-channel RAM platforms with 12GB.

The problem is that SCALE, arguably unlike CORE, isn’t only focused on serving basic storage needs. Buried way down in the Scale Hardware Guide is notes about needing to add more RAM beyond basic storage needs, but your average personal/homelab user isn’t going to find that and SCALE is presently going through what I’d assume is a large number of users migrating from CORE (traditionally a basic storage-centric platform) to SCALE (something with a lot more flexibility).

R.e the guidelines, I have to go down into the above-linked part of the SCALE docs to find recommendations on how to adjust my decision on how much RAM to throw SCALE’s way. The guidelines are, technically, reasonable, but they’re not easily communicated to someone who can just go to iX’s website, download and start using SCALE. I’m going to be much more inclined to think about it more if the minimum requirements for RAM instead said “8GB for basic storage functions, 16GB+ when adding apps/VMs”.

Tying back into the above, just in general it might be nice for iX/SCALE to add some additional basic considerations for the audience that isn’t in a professional or enterprise environment. And for those who think they’re a professional just because they homelab, but aren’t. Amending minimum system requirements to briefly account for the various things you can stack on top of basic NAS functions with SCALE, some GUI pointers, etc.

Considerations written from the perspective of someone who always ends needing to edit TrueCharts documentation to add a degree of hand-holding far beyond that of which I’d imagine any user savvy enough to use SCALE and install apps on it would need.

When I was trying things out it certainly does pop up a warning about memory when you look at installing apps already. I quickly backed out…

2 Likes

Ah, fair enough then. Been years (since Angelfish) since I’ve had to setup apps.

I am one such user and with 10GB of memory and 5x 4TB drives, and a few apps using (say) 1GB of memory, but no VMs, I still get 99.5% ARC hit ratio!! I guess there is enough memory to cache the ZFS metadata and a lot of the Plex metadata and the media streams benefit from read-ahead.

So, I would agree that a minimum memory of 8GB will still likely be sufficient for reasonable performance from a relatively small amount of disk space - and I think that @kris and @Captain_Morgan are giving us good guidelines.

4 Likes

I just did a “sudo swapoff -a” and (as expected) swap went from 0.69GB to zero, and cache went from 4.45GB to 3.95GB (almost as expected).

I will report back as to whether this has a noticeable impact on the ARC hit ratio once I have had enough usage to tell.

Thanks to @everyone for the feedback so far. I appreciate all the fixes and good works that ixSystems have put into 24.04.1 to correct the biggest issues, but I am still unclear what issues remain and whether it will be stable enough for me to upgrade or whether I should wait for 24.04.2…

Software Status - TrueNAS Roadmap - Open Source NAS Software

You can generally follow the guidance here for “Conservative” if that adjective more closely follows your risk tolerance.

1 Like

Thanks for posting this. I just migrated to Dragonfish -24.04.1.1 and came upon this energy usage issue.

To be clear, if I run the shell command you posted this resolves this issue 100% for now, correct?

echo n >/sys/kernel/mm/lru_gen/enabled