Dragonfish swap usage/high memory/memory leak issues

winnielinnie · May 13, 2024, 3:19pm

This would be the best long-term solution.

It’s good to allow the system to swap for those extreme situations where memory demands are too high, and it needs a safety cushion to prevent a system crash or OOM.

Even though I’m on Core (FreeBSD), I don’t disable swap. Even if it never (or rarely) is needed. The cost is inconsequential, yet it could save my system from those rare (but possible) situations.

Best of both worlds for SCALE: ARC behaves as it was meant to, without restriction and without causing system instability, yet there is a swap available if it’s ever needed.

@mav: To complement this fix (disabling lru_gen), wouldn’t it also make sense to set swappiness=1 as the default for SCALE?

mav · May 13, 2024, 4:05pm

The swappiness=1 setting is quite orthogonal to this issue. It balances page cache evictions between anonymous pages swapping and file-backed pages writing, while the problem here is in balance between page cache general and ARC, to which this tunable does nothing. That is why it did nothing to this issue by itself when tried.

winnielinnie · May 13, 2024, 4:14pm

Hence, as something additional to complement the changes/fixes you’ve mentioned. (Since even @kris would go so far as to disable swap entirely, regardless of anything else.)

I think swappiness=1 would make for an ideal default^[1], with the above lru_gen fix.

As opposed to leaving it at the default of 60, which I don’t believe anyone believes is proper for a NAS. ↩︎

mav · May 13, 2024, 4:30pm

Formally, I could say that some never used memory may worth to be swapped out to free memory for better use. But I have no objections.

Stux · May 14, 2024, 3:27am

root@titan[~]# cat /sys/kernel/mm/lru_gen/enabled
0x0007
root@titan[~]# echo n >/sys/kernel/mm/lru_gen/enabled
root@titan[~]# cat /sys/kernel/mm/lru_gen/enabled    
0x0000

Does that look right?

And I assume this should actually be done as pre-init script?

EDIT: reading the docs,

https://docs.kernel.org/next/admin-guide/mm/multigen_lru.html

It appears that “echo 0” (zero) would be more correct

ClimbingKid · May 14, 2024, 9:24am

I rebuilt a VM this morning, with fresh install of 24.04, restored my config, moved the HBA over with my datasets, and did this.

echo n >/sys/kernel/mm/lru_gen/enabled

All up and running fine at present. Before it would fail in a matter of hours, so should be able to tell pretty quickly. Turning off swap it worked fine the last few days, so interested to see how this goes. Will report back.

Thanks
CC

ClimbingKid · May 14, 2024, 12:41pm

Its survived 6 hours - current performance is great. GUI is responsive during large transfers, which it was not before, and htop not showing any swap being used.

So far so good

CC

wolfie · May 14, 2024, 3:57pm

Has anyone else noticed since trying these fixes Dragonfish still won’t use up to the ARC max? When I downgrade to Cobia it quite happily uses the whole 32GB (50%) and sticks to it, but when I either disable swap in Dragonfish or disable lru_gen it starts high and then becomes very reluctant to use ARC at all.

This is on a machine with 64GB of RAM

Another example

The sections with the flat line are when the system is running Cobia, the sections with ARC all over the place are Dragonfish.

bitpushr · May 14, 2024, 11:18pm

Have applied this to my SCALE system just now after a fresh reboot. Will see how it goes. In case it’s of any relevance, I was in the process of posting back to the other thread after seeing the latest developments in this one and removing the previous ZFS RAM usage limit I’d put in place but disabling swap seemingly resolved the issue entirely for my system; it has been stable for roughly 5 days now without issue.

I’ve since re-enabled swap, double-checked there’s no ZFS RAM usage limits in place anymore, and will see how the system goes with this latest fix over the coming days.

I’m hoping that from a retroactive or process improvement standpoint, especially for major SCALE releases that see things like large Linux kernel version jumps, this issue results in more thorough testing/QA or more investigation being done prior to future releases from iX. I won’t dwell on it since it looks like a fix is incoming, but would like to reiterate that these issues were reported by people in multiple Jira tickets in the days (and in one case, weeks) prior to DF’s release.

Some of these tickets were, IMO, ignorantly, closed as being a TrueCharts issue when it wasn’t the case. DF’s release should’ve been pushed back by 1-2 weeks while investigations occurred. This, coupled with ARC GPU drivers ultimately not being included in the release (a large factor in people’s desire to upgrade or migrate to DF) has marred what could’ve been the best SCALE release to date.

Here’s hoping ElectricEel is an electrifying one.

winnielinnie · May 14, 2024, 11:34pm

Very well said, @bitpushr.

csj · May 14, 2024, 11:36pm

Got any links to those jira tickets where this was reported before we publicly released 24.04.0?

mav · May 15, 2024, 12:01am

After more thinking we’ve included both into 24.04.1 release. Should be in the next Dragonfish nightly build.

winnielinnie · May 15, 2024, 12:06am

Hindsight is always 20/20. This isn’t criticism of anyone, nor iXsystems in general.

I think the lesson is that there’s probably good reason to investigate further when performance issues arise when upgrading from one version to the next, especially when it’s a major kernel version upgrade.

It’s hard for iX staff to reproduce, since a handful of people cannot emulate the different workloads or “eyes” of the community users.

Just as an example, what might alert someone to a problem is they notice their drives are “noisier” than they remember, or their CPU is more “active” than normal. This can be brushed aside by someone else as “Oh, that’s normal. You only noticed it recently.”

An issue with “services” consuming memory in Dragonfish RC1. Closed and suggested to see if it still happens in the .0 release.
Marked as a duplicate of above, reported by a different user. Dragonfish RC1.
System freezes with large file transfers on Dragonfish RC1. Closed because the system doesn’t have the recommended amount of RAM and the CPU only has 2 cores.
Transferring files slows down the system and eventually results in the kernel crashing in Dragonfish RC1. Closed as “cannot reproduce”.
EDIT: Missed this one, which @bitpushr posted below. Probably the most prophetic of the overlying issue with Dragonfish RC1 (before its stable release.)

Looking back at these now, it’s likely they were interrelated to the RAM/Swap/ARC/LRU issue that was introduced with the newer Linux kernel.

No one’s a psychic.

It’s just that when someone is “familiar” with the behavior of their own system, it might be a canary in a coalmine for someone with more expertise to investigate further. Especially if there’s a common denominator (such as Dragonfish RC1), and it’s more than just one person reporting it, across different types of systems.

dan · May 15, 2024, 12:13am

Twenty minutes to import your pools? Must be TrueCharts. At least they didn’t close my ticket on that.

csj · May 15, 2024, 12:40am

ahh yes, constructive criticism can’t be given without destructive criticism I guess.

Stux · May 15, 2024, 12:56am

BTW, my boot disk has cooled down since disabling swap. CPU fan no longer kicks in to cool it down. With the heatsink it was kicking in periodically (as opposed to non-stop)

Heh. In other words the first symptom I had is when TrueNAS started throwing critical drive temp alerts, I just assumed the feature to throw them was new in DragonFish

bitpushr · May 15, 2024, 7:42am

100% agree with this, and cheers for linking those tickets winnie; they were most of the ones I’d come across and that (having just checked now) were in my browser’s history.

There was also this one . Initially I think Caleb was right to suggest multiple additional troubleshooting steps, the user then replied and the ticket went unanswered for a week before Caleb responded again.

This time the response started by asking the user if they’d done any troubleshooting, fair, since the user didn’t indicate the results of what Caleb asked them to do, but the reply then moved into the comment I left on the ticket being regarded as anecdotal (we had multiple users reporting the DF performance issues in the TrueCharts Discord less than a day after DF’s release and started to direct them to opening tickets with iX) and dismissed because “we have >7k installs of RC1 in the wild…”. Whilst it might have been true, in hindsight it didn’t really matter; these issues came about after DFs’ release despite there being thousands of RC1 installs in the wild.
I’d wager it’s pretty safe to assume that in our Discord of over 13k members, if a wave of people start posting all with a common issue across varying hardware setups, given the context of the DF update being the common factor amongst them, my comment (or the comment of someone else saying the same thing) shouldn’t have been dismissed. I could’ve, in hindsight, linked to messages/posts from those users to provide additional context which I’m happy to reflect on.

The closing comment “it also doesn’t help that your debug was incomplete because the system is so slow to respond” should have at least prompted a further investigation from Caleb/iX’s side. The ticket has since gone unanswered and thus been closed, and it’s ultimately only supplementary to the ones winnie linked above which are the primary pointers.

As I said before, I’m not going to bang on about it or anything, winnie already worded it perfectly and I’m glad a fix has been found and will be included in .1’s release. We’re all ultimately here because we like the work that’s being done (or at least most of us do) and want TrueNAS to grow better every day.

Onwards and upwards.

dan · May 15, 2024, 11:29am

You (plural, i.e., iX) are spring-loaded to blame TrueCharts, even for issues which manifestly don’t involve them; my ticket is one example among many of this. If you choose to dismiss that as “destructive criticism,” I guess that’s your call.

csj · May 15, 2024, 12:06pm

And most interactions (not all) I’ve had with TrueCharts users are spring-loaded to speak “matter of factly” on subjects condescending developers/iXsystems. And this isn’t anecdotal, go to the TrueCharts discord and you’ll see an environment that seems to actively encourage disparaging TrueNAS’s/iXsystems hard-work. You also seem to be REALLY hung up on the fact that your zpool now takes an inordinate amount of time to import after upgrade to DF. We have your ticket, it’s assigned to an engineer, it’s still opened and we will hopefully investigate the issue. We haven’t blamed anything on TrueCharts for your particular use-case. But hey, if an open ticket, assigned to engineer for investigation somehow gets interpreted by you as a “truecharts issue”, I guess that’s your call.

winnielinnie · May 15, 2024, 12:16pm

bitpushr-psychic