Dragonfish swap usage/high memory/memory leak issues

Got any links to those jira tickets where this was reported before we publicly released 24.04.0?

After more thinking we’ve included both into 24.04.1 release. Should be in the next Dragonfish nightly build.

4 Likes

Hindsight is always 20/20. This isn’t criticism of anyone, nor iXsystems in general.

I think the lesson is that there’s probably good reason to investigate further when performance issues arise when upgrading from one version to the next, especially when it’s a major kernel version upgrade.

It’s hard for iX staff to reproduce, since a handful of people cannot emulate the different workloads or “eyes” of the community users.

Just as an example, what might alert someone to a problem is they notice their drives are “noisier” than they remember, or their CPU is more “active” than normal. This can be brushed aside by someone else as “Oh, that’s normal. You only noticed it recently.”



Looking back at these now, it’s likely they were interrelated to the RAM/Swap/ARC/LRU issue that was introduced with the newer Linux kernel.

No one’s a psychic.

It’s just that when someone is “familiar” with the behavior of their own system, it might be a canary in a coalmine for someone with more expertise to investigate further. Especially if there’s a common denominator (such as Dragonfish RC1), and it’s more than just one person reporting it, across different types of systems.

2 Likes

Twenty minutes to import your pools? Must be TrueCharts. At least they didn’t close my ticket on that.

:upside_down_face: ahh yes, constructive criticism can’t be given without destructive criticism I guess.

BTW, my boot disk has cooled down since disabling swap. CPU fan no longer kicks in to cool it down. With the heatsink it was kicking in periodically (as opposed to non-stop)

Heh. In other words the first symptom I had is when TrueNAS started throwing critical drive temp alerts, I just assumed the feature to throw them was new in DragonFish

1 Like

100% agree with this, and cheers for linking those tickets winnie; they were most of the ones I’d come across and that (having just checked now) were in my browser’s history.

There was also this one . Initially I think Caleb was right to suggest multiple additional troubleshooting steps, the user then replied and the ticket went unanswered for a week before Caleb responded again.

This time the response started by asking the user if they’d done any troubleshooting, fair, since the user didn’t indicate the results of what Caleb asked them to do, but the reply then moved into the comment I left on the ticket being regarded as anecdotal (we had multiple users reporting the DF performance issues in the TrueCharts Discord less than a day after DF’s release and started to direct them to opening tickets with iX) and dismissed because “we have >7k installs of RC1 in the wild…”. Whilst it might have been true, in hindsight it didn’t really matter; these issues came about after DFs’ release despite there being thousands of RC1 installs in the wild.
I’d wager it’s pretty safe to assume that in our Discord of over 13k members, if a wave of people start posting all with a common issue across varying hardware setups, given the context of the DF update being the common factor amongst them, my comment (or the comment of someone else saying the same thing) shouldn’t have been dismissed. I could’ve, in hindsight, linked to messages/posts from those users to provide additional context which I’m happy to reflect on.

The closing comment “it also doesn’t help that your debug was incomplete because the system is so slow to respond” should have at least prompted a further investigation from Caleb/iX’s side. The ticket has since gone unanswered and thus been closed, and it’s ultimately only supplementary to the ones winnie linked above which are the primary pointers.

As I said before, I’m not going to bang on about it or anything, winnie already worded it perfectly and I’m glad a fix has been found and will be included in .1’s release. We’re all ultimately here because we like the work that’s being done (or at least most of us do) and want TrueNAS to grow better every day.

Onwards and upwards.

2 Likes

You (plural, i.e., iX) are spring-loaded to blame TrueCharts, even for issues which manifestly don’t involve them; my ticket is one example among many of this. If you choose to dismiss that as “destructive criticism,” I guess that’s your call.

And most interactions (not all) I’ve had with TrueCharts users are spring-loaded to speak “matter of factly” on subjects condescending developers/iXsystems. And this isn’t anecdotal, go to the TrueCharts discord and you’ll see an environment that seems to actively encourage disparaging TrueNAS’s/iXsystems hard-work. You also seem to be REALLY hung up on the fact that your zpool now takes an inordinate amount of time to import after upgrade to DF. We have your ticket, it’s assigned to an engineer, it’s still opened and we will hopefully investigate the issue. We haven’t blamed anything on TrueCharts for your particular use-case. But hey, if an open ticket, assigned to engineer for investigation somehow gets interpreted by you as a “truecharts issue”, I guess that’s your call.

bitpushr-psychic

4 Likes

I don’t think that my concern there is unreasonable (and I note that I’m not the only one affected by whatever’s causing it), particularly when that appears to be the root cause of the apps problems[1],[2]. And I’m aware that it’s assigned, and I presume more is happening behind the scenes than is noted on the ticket. I’m referring more to the first several comments on that ticket, which went straight to “TrueCharts” even though I thought the pool import issue was pretty clear even then[3].

And yes, I’m also aware that at least some on the TC side don’t play well with iX. I think it’d be in everybody’s best interests (especially the users’) that iX and TC work better together, and I suspect there’s background that I’m not aware of, so I’m not assigning blame anywhere on that question. But I’d very much like to see the situation improve.


  1. And I’ll freely admit I’m not an expert in the inner workings of these things, but “apps pool isn’t imported when the system finishes booting” and “k3s can’t start because the relevant datasets aren’t available” sure seem like they’d be related. ↩︎

  2. That TrueCharts have decided that only Dragonfish will be a supported platform is an aggravating factor, but definitely not your fault. ↩︎

  3. Again, I could be wrong–but if the thinking is that I am wrong, I think addressing that would have been helpful: “I see you’re concerned about the time to import your pool, but I don’t think that’s the fundamental issue here because…” As I don’t have any such explanation, I’m left thinking that’s where the issue is. ↩︎

This was enlightening and I understand your point of view a bit better. Your ticket is assigned to someone else but I’ll keep an eye on it. Your particular problem (zpool import taking > 20mins (I looked at the logs on the ticket)) is perplexing and to make matters worse, there really isn’t too many people reporting the same problem (unless I’m missing something). However the other user that reports similar issues seems to also have different sized vdevs in a zpool. I have 0 clue if that’s even remotely related, but it’s an obvious similarity that I’ve seen.

1 Like

that seems to be working. thanks

Thanks to everyone for the guidance in addressing these performance issues in Dragonfish. I read the release notes before upgrading from Cobia and noticed the removal of the ZFS ARC 50% RAM limit. After upgrading to Dragonfish and noticing extremely slow performance after a couple days and swap full, I had a suspicion that something went horribly wrong with the ARC limit removal. Sure enough, setting the limit back to 50% solved my performance issues.

It’s good to hear that the multi-gen LRU changes in the 6.6 kernel seem to be the cause here. I’ll try with swap re-enabled, ARC 50% size limit removed and multi-gen LRU disabled. It seems like that is the preferred resolution to these issues, correct?

2 Likes

Yes.

1 Like

Also sysctl -w vm.swappiness=1 if you want to mirror the DF nightly changes.

5 Likes

I know it’s off-topic to this thread, but just to respond to your point, DragonFish is the only version of SCALE currently eligible for support from us, but our apps are still working fine on Cobia at current.

Prior to DF’s release we made available a legacy, Cobia-specific branch of TrueCharts’ apps for people staying on Cobia longer-term to use. Nobody has been forced to move to DragonFish yet to ensure apps continue getting updates etc. but once apps stop working on Cobia, users wanting to remain on there longer-term (for whatever reason) can move to the legacy apps branch.

I’m not personally encouraging people to migrate to DF until the .1 update drops and resolves the already-discussed issues with it.

2 Likes

True, they are.

Just to be clear, this involves removing and re-adding the TC catalog, right? I don’t see a way to edit the existing catalog to use a different branch.

1 Like

3 Days and all fine here - in fact rather than going back to Core, im planning to stay with Scale. Writes appear to be faster, ISCSI connections are more solid. Pretty happy with Scale after the initials problems.

Thanks for everyones help, and work on this - I also learned quite a bit too.

CC

4 Likes

What is the reason for setting vm.swappiness=1 ?

This would bias memory reclaim against anonymous pages and towards file pages. Isn’t this the exact opposite of what we would want for a NAS use case?

IMHO, for a NAS under memory pressure, it would be better to reclaim unused anonymous pages instead of file pages that could hold file data that could be needed.

Ref: In defence of swap: common misconceptions

2 Likes