RAM Size Guidance for Dragonfish

Just chiming in to update: have just rebooted the SCALE system after applying the previously recommended changes.

Running "arc_summary | grep “Max size” gives a result of ‘Max size (high water): 16:1 32.0 GiB’

I’ll now leave the system completely unattended, as I normally would, for at least 24hrs before looking back at it and chiming back in.

1 Like

62 Truecharts apps and 256 GB RAM.

Update to DF yesterday and I’m having terrible performance issues, pretty much the same as it was on Cobia. I cleaned all my snapshots and my system is stable apart of the apps. Editing and Saving an app is taking a lot of time. Often the whole ui crash and is stuck for 5 minutes on the login loading page.

I have migrated another server with fewer apps and 32 GB RAM and this server is snappy responsive.

Please IX, hear me out: Fix the rubbish gui. It doesn’t make any sense with 256 GB and 72 cores to have something that feels such broken.

Just seeing this reply to my post now. FYI it now shows as “34360000000”, can reply once the post-init workaround is removed in the future if needed.

After upgrading to Dragonfish my system would lock up completely and stop responding to network and console when running Cloud Sync jobs (even a “dry run” would cause a crash). I am syncing with Backblaze and have the “–fast-list” option checked (don’t know if that makes a difference though).
Limiting ARC to 50% solved this, and the system now seems to be running stably again.
This is a ten year old server that have been absolutely stable through all upgrades of Free-/True-nas. It has 16GB RAM and is running a number of docker containers (mariadb, grafana, nextcloud influx, plex etc.).

1 Like

Dropping my own thread in as it looks to be related to swap as well. The issue with the boot drive can be ignored as it’s unrelated and I think that’s my own fault for rebooting the system so readily (though it certainly was an odd issue).

As I mentioned in the most recent post, swap usage is way up compared to Cobia, so I’m playing around with ARC limits to ensure I don’t hit this. I’ve also temporarily outright disabled swap to avoid running into issues while I work :stuck_out_tongue:

@kris @Captain_Morgan

Is zram a viable alternative to outright disabling swap?

I’ve had success with it on Arch Linux (non-server, non-NAS), but I’m wondering if it would serve SCALE users well?

  • No need for a swap partition / swap-on-disk
  • Anything that needs to be “swapped” will remain in RAM in a compressed format (ZSTD)

So under ideal conditions, it never gets used. However, to prevent OOM situations, there’s a non-disk safety net that should theoretically work at the speed of your RAM.

I’m not sure if there are caveats of zram in the presence of ZFS, however.

2 Likes

Love zram. You’d still have to define where linux swap space is, but, it could use zram I suppose. It’s very very fast. It’s a souped up tmpfs of sorts.

Interesting!

1 Like

So. Basically ram doubler :wink:

1 Like

not familiar w zram, does that mean you are essentially allocating space from RAM as RAM disk and then ask linux to swap to that RAM disk? I think i’ve done the same in Windows where I placed a page file under primocache’s L1 cache and Defer write partition, Long time ago primocache had issue with Looped swap and ending up with BSOD crash, but they fixed it and apparently works w/o any issue. I guess same could apply here if my interpretation is correct.

It takes up 0% extra RAM if nothing is being swapped. And if anything needs to be swapped, it’s in a compressed format (still in RAM), to make room for other data/services that require RAM.

Therefor, nothing is being preallocated in RAM. Whatever you set the zram “size” to, it always starts at 0%, and can grow/shrink dynamically, based on if things are being “swapped” in or out.

Think of it as dynamic swap that stays in RAM (compressed), where nothing is swapped to disk.

1 Like

interesting…i see, so more less like dynamic sized ram disk; with exception if what we dealing with is a memory leak, then the zmem would grow non stop? :grimacing:

1 Like

If you’re dealing with a bad memory leak, then you’re screwed either way. :wink:

1 Like

Current focus is on trying to find a quick fix to the RAM issues in 24.04.0

We’ve had one potential case in RC.1, but tens of cases in .0 (out of 20,000) I assume its just that they are more complex and heavier workloads and more use of swap.

We’ll get some test results soon.

2 Likes

No, you do configure a maximum size, so, not endlessly,.

1 Like

Its worth noting that none of the TrueNAS Enterprise systems exhibit these issues under extreme load testing. However, there are notable differences:

  1. All systems have adequate RAM for their workloads
  2. Swap is disabled (because there is adequate RAM)
  3. We don’t run unverified Apps on Enterprise appliances

We are reviewing which of these is most important. Betting is allowed, but it would have to be on another web site.

2 Likes

I would only comment that swap is normal and used widely in Debian and derivatives. If swap is actually itself causing the issue, then, there’s still a bug somewhere in Truenas. Seriously doubt there is a bug in swapping code, though there certainly has been and obviously new versions of things are on Scale and could be introduced. But still outside of Scale, that would be tons of people around the world not using Truenas reporting this I would think. More likely there is a memory leak somewhere, maybe even in ZFS. Unless IX modified the Linux code. That’s the normal reason for swap usage and thrashing. Only problem with that is disabling swap thus far seems to resolve it, but I’d love to see some arc summaries posted during the time the problem is occurring. And, memory reporting during the time. I don’t envy you guys Captain!

My understanding is openzfs rewrote at least some of the arc code and how it adapts. I believe I saw that last year. If that version is being used (and it should be), there could obviously be issues there in some rare cases. There’s so many possibilities!

So, of the 3 choices, I believe we saw this happen with SnowReborn or whomever had the 1TB memory? So many posts. I would say rule out #1 (me guessing). #2 does seem to resolve it thus far, not heard anyone who said the problem came back. I believe if that is the answer then there’s still another bug somewhere but it would still qualify as the answer. I don’t think it’s #3, so, I vote #2. Though, I believe that should not happen as noted and the actual problem not the symptom lies elsewhere. Glad it seems to be rare.

1 Like

I think it’s clear, that if someone wants to repro, then swap should be enabled when trying to repro.

And then look for excessive swap partition disk utilization.

1 Like

Evidence seems to be that there are two safe modes:

  1. ARC=50%… Swap can be enabled
  2. ARC = 9x%… Swap should be disabled

The 3rd mode
ARC = 9x%… SWAP = Enabled - seems to be sensitive to application’s use of memory.

This makes some sense… ZFS ARC hogs the RAM & forces apps and middleware into Swap space.

3 Likes

What about this, which we’re still waiting to hear some experiences:

  1. ARC Max = “FreeBSD-like” (Dragonfish default, RAM - 1 GiB) + swap enabled + “swappiness” set to “1”

EDIT: Or maybe not. It’s already not looking like a viable option.

I have swap enabled and at 70% or so (memory fails me), the “safe modes” would be system dependent of course but I get your point. That’s how it’s always been done outside of Truenas. Admin picks number. But you wanted to improve that to let it handle on it’s own. Without swap, I don’t see how you can have 0 OOM errors though when under memory pressure with arc filling ram. You must have a way to avoid that. It’s always been a problem in the old days.

ZFS does hog the ram of course. If ZFS takes it first, and something else needs it (which in the past was rare except for things like VMs), ZFS can’t evict fast enough meaning swap. If your swap space is on HDD, then, not fast at all. I know they made changes such as this and others with the arc:

There are still lots of OOM happening in openzfs. A partial list of interesting ones below (not ones about corruption as that is different), but, the proof of course is your distribution base and if swapoff resolves all of the issues without OOM, then, that’s ok I guess. You have allowed ARC to grow to almost memory size so this is much different than the way people have typically run zfs. Time will tell for sure!

While your enterprise customers who have no issues and all have plenty of memory, my concern would be for the armchair Plex/converted home 10 year old computer guys with very little memory. They can be the ones with plenty of overuse. I guess guidelines can be changed, etc. If swap is simply now incompatible with zfs that allows filling ram with arc, then, your installer shouldn’t be adding it or asking to add it anymore. But other than telling people here, what about the people who don’t use the forums, how will they know to disable swap or maybe you can make it part of an update?

2 Likes