Going insane, truenas webUI dies, again

jason · May 6, 2024, 1:18am

Thank you, having similar issues after recently upgrading from Cobia to Dragonfish. From day one UI has become unresponsive, to the point I cannot login to the UI mostly. Terminal is fine.

Also receiving warning emails multiple times a day:

Failed to check for alert ScrubPaused: Failed connection handshake
Failed to check for alert ZpoolCapacity: Failed connection handshake

All new since dragonfish.
Server is only 16GB ram, but doesn’t serve apps, only ZFS and SMB are actively used (i.e. just a storage server)
RAM usage was 15.6 GB and 900 MB SWAP in dragonfish looking at htop.
I’ve performed the zfs_arc_max change to 50% ram as a preinit script and now it’s running much better (UI is responsive, no issues logging in) , the emails have also stopped being sent.

winnielinnie · May 6, 2024, 1:30am

We’re seeing a pattern here.

Also, somewhat related (or maybe not), but I can’t help but notice this, still being on Core myself:

The fact that any swap is being used at all means that memory is not being handled gracefully. But levels of 1 to 2 GiB of swap for a simple server with only ZFS + SMB?

No one’s gaming or doing intensive number crunching or editing 4K videos. This is a server…

cmplieger · May 6, 2024, 1:43am

Now the question is, is it the swap in itself that breaks stuff, or is the buffer ARC leaves too small.

winnielinnie · May 6, 2024, 1:58am

Ask the upstream ZFS developers why they set the default maximum ARC size for FreeBSD at “1 GiB less than total RAM”, whereas for Linux it is “50% of total RAM”.

Swap being used in these cases of Dragonfish is likely only a symptom and/or correlates to the flaky memory management if the ARC is not restricted.

mooglestiltzkin · May 6, 2024, 2:29am

my setup i didn’t setup a swap. i use 16gb of ddr4.

the only issue i noticed in the recent tn build was some sort of java memory error when i did htop

have 20 active docker containers.

the ui slow down for me seemed to have been gone. i had done a fresh reinstall of truenas.

was wondering if i needed to upgrade to 16x2 for 32gb dd4 dual channel. but things seem to just work for me setup so i held off on that.

for docker containers, some of the apps i add a hard memory limit. if you set limit too low some of these containers will crash. so it takes some fine tuning to know the limit to set.

sfatula · May 6, 2024, 2:51am

There’s another pattern emerging. This is the 4th or 5th I’ve read between here and reddit where the problem is claimed to be resolved by a fresh install (something I like to do). I still think there are several issues here, not one.

Would be interesting to hear back as time passes, is it still ok after a few days or a week. Did they also change the arc limit?

SnowReborn · May 6, 2024, 8:00am

Thanks for the tip! I knew this from lawrence’s system guide. But I was afraid to change it on 24 because I am afraid to cause unintended consequence. My friend also suggested to just turn off swap completely and sudo swapoff -a; but I am not ready to test things on my live server just yet. :S; but yeah, reverting to 50% doesn’t “seem” to going to cause any issues; which I am indeed tempted to try. But I why 50% specifically? because it’s what 23 uses by default? What if you manually set it to 90% or 85%? So we can isolate dragonfish is going insane because it’s over 50% allocated for ARC, or it’s just dragonfish’s dyanmic ARC allocator at fault.

SnowReborn · May 6, 2024, 8:02am

Mine is fresh install with basically all default settings and i am experiencing the issue. Yes they changed ARC setting. Back in 23, debian default behavior is to use 50% of system mem as ARC, now they upped to around 90~95%.

SnowReborn · May 6, 2024, 8:05am

This make senses but also strange. Have we heard any one using this tweak prior in 22 / 23 changing it more than 50% experiencing similar issues? What if you just set it to 85% or 90% manually? Can it be that the dragonfish’s “dynamic allocation” at fault? if so maybe it doesn’t have to be 50%, but any percent as long as it’s manually fixed? I am too afraid to test on my live server. So if you want to won’t be a bad idea to give a try.

sfatula · May 6, 2024, 9:02am

I am 1000% aware they (IX) changed the default. I was speaking of the USER changing the arc settings, not IX. My arc has been set to 70% for a very long time without issue (I’d say 6 months on Cobia though), and many testers of Dragonfish did not have any issue. There are competing data points. We have guys who had trouble and a fresh install fixed it. And we have guys where no fresh install but an arc change seemed to fix it. Those are not the same conclusion. I personally have 4VMs, 19 apps now, system busy virtually 24/7, only 64Gb ram, 70% arc, and no issues, but on Cobia. Decent number of users too. The difference there is I made up the 70% based on my specific workload, with a human (me) looking at the data. Vs a system trying to adjust it, on Linux, without the worlds (shall we say) leading memory management. Systems running openzfs often just default to the 50% limit, so, it’s up to YOU to change it to what you want based on the system admins skill.

The reason for the 50% as it was before isn’t IX, they just use openzfs, and openzfs decided on the 50% limit for linux for good reasons long ago. However, since that time, many things have improved and the issue is today, is 50% really still important, or, have the issues that seemed to cause the problem mostly or entirely been resolved. It’s not a static decision, just because it was bad many years ago doesn’t mean it still is (despite the very valid original reasons). It would appear thus far there are systems where it may indeed still matter. But it’s also not all systems.

The fact that a fresh install did not work for some doesn’t mean it doesn’t solve it for some. That fact that some didn’t change the arc but re-installed and have no issues, also doesn’t mean it didn’t work for them.

The problem for ix to solve is do they have a way to determine what may cause the issue in some systems but not others. I’m not sure they can but am hoping so. In another thread, there is a guy with 1TB ram that has the issue. And it wasn’t even remotely close to full, what about him? And there is the bug report of a memory leak in the middleware that causes the issue as well, etc. Tough problem.

Then there is the issue of amount of memory on systems. It could be generally said that for smaller ram systems, 50% is possibly too high, and, for huge memory systems, it’s likely way too low. I mean an 8GB ram system, is 50% (4GB) really appropriate? It’s possible I guess but seems high. But in a 1TB ram system, do you really want say 400GB memory wasted and not being used at all? But there are outliers too. Man, it’s 4AM and I seem to be making all sorts of typos, hopefully, got all the facts right.

I’m not sure if I ever read of someone on Cobia who increased the arc limit having issues, I don’t recall one but maybe there was as I certainly don’t read every Truenas post. I primarily follow openzfs. I believe a bunch tried upping it (like me) and it worked, so, with testing going mostly well (although the Truecharts doc guy had this issue with the Beta), IX went for it (my take). I had long run >50% zfs arc pre-truenas on debian without issue, on many systems at many companies. It’s a tough call really as they want to have no limit, many many complaints about the 50% limit and people who don’t want to change it themselves.

Jorsher · May 6, 2024, 5:17pm

I’ve always been able to trigger these issues by rsync’ing a large number of files from one pool to another. I have 512gb of ram and there’s always 25-30gb free. Swap still gets used which I don’t understand but haven’t got around to removing it. I found some ‘failed connection’ and ‘failed page allocation’ errors in logs. Wondering if there are multiple causes for this issue.

cmplieger · May 6, 2024, 5:21pm

It’s probably the dynamic adjustment that breaks stuff. Indeed setting it manually to 70% which you know leaves enough space for your apps works fine.

Perhaps a feature where it starts at 50% and then re-evaluates after a week and adjusts the setting up/down would be interesting, instead of trying to dynamically adjust.

winnielinnie · May 6, 2024, 5:38pm

The ARC’s “high-water” under FreeBSD remains at “RAM minus 1 GiB” at all times. It never “adjusts” this parameter on its own. The only “adjusting” is the ARC “target size”, which will always be less than (or equal to) the high-water.

For FreeBSD, this is no problem. (Cue endless argument of Linux vs FreeBSD memory management, especially in regards to ZFS systems.)

For Linux? This is a problem. Hence, the high-water is limited to “50% of RAM” by default, upstream. It will still dynamically adjust the ARC “target size”. So there’s lots of “adjusting” happening in real-time. The only difference is the maximum target size allowed, which was previously restricted (intentionally) by upstream OpenZFS for Linux.

Some people on Linux can get away with 50%. Some at 75%. Some even maybe at 90%. However, this is a non-issue for FreeBSD systems, which default to “RAM minus 1 GiB” which is a very, very high ceiling.

For example, a system with 128 GiB of RAM, then a high-water (“max allowed”) value of 127 GiB is 99.21%. (It’s around 98.5% for systems with 64 GiB of RAM.)

Trying to “handle it like FreeBSD does”, means that it’s the equivalent of increasing the high-water to 98% or 99% for many systems. I REALLY DOUBT that anyone using SCALE Cobia ever did this. They probably set it to around 75% or even as high as 90%. I never heard of anyone boosting it to 99%.

So these anecdotes of “Well, I know some people that overrode this parameter on Cobia, and they didn’t have any issues like we’re seeing on Dragonfish.” Let me ask you this: Did they override the parameter to allow up to 99% of RAM for the ARC? If not, then no, it’s not the same thing as what Dragonfish introduced.

sfatula · May 6, 2024, 5:39pm

Dynamic adjustment is what openzfs does, not Truenas. I was answering SnowReborns question.

sfatula · May 6, 2024, 5:49pm

@winnielinnie is correct about freeBSD. But setting it to 99% or RAM-1GB, would never mean it used 99% on freeBSD either. Other things will use far more than 1GB, at least on Scale, so the arc would never grow to that size. It does mean it will manage it automatically without fail though. The main difference then becomes, if essentially no limit is impossible on Linux, is one has to manually figure the top limit on Linux (as always in the past) vs not worrying about it in Core. Core would therefore undoubtedly have a slightly higher max arc usage in that case as it could use every last ounce of memory vs some preset limit.

Regarding my comment about updating my limit, you must not have read all my post. SnowReborn ASKED if anyone had done this (adjusted to >50%), so I responded yes with details. That was not a comment saying this means there is not an issue on Dragonfish at all! It was merely responding to his question, and the answer is correct, It’s not an anecdote as you presented it. I specifically stated it’s not the same thing in extreme detail. If you were speaking of me. If not, sorry.

winnielinnie · May 6, 2024, 6:00pm

I don’t believe that’s the problem we’re seeing. I believe there’s a “threshold” where ARC vs. non-ARC pressures are in conflict, which cannot be handled gracefully by Linux’s memory management, and hence more reliance on swap (even in the presence of “free” memory), as well as other less predictable issues we’re seeing manifest: I/O eventually comes to a crawl, sluggish web UI, system lock-ups, etc.

The seemingly unrelated things, such as the sluggish web UI and I/O issues and lock-ups have been resolved by multiple users who simply overrode the parameter to a value that is approximately 50% of RAM. (I’d wager they’d see the same improvements if they chose 75% of RAM as well.)

There’s a certain threshold (as this value approaches 99%) where simply having a high-water ceiling of “X amount” is enough to make Linux + ZFS trip over itself under certain workloads… even if such RAM is never outright consumed.

It’s odd that their systems are swapping 1 - 2 GiB at any time. Why would a 128 GiB system even swap at all? You almost never saw this with Core.

kris · May 6, 2024, 6:03pm

Cross posting this here to see if anybody has a reproduction case they want us to help take a look at.

sfatula · May 6, 2024, 6:24pm

You keep misunderstanding what I am saying (the comment you quoted), perhaps I am missing some words you want to see. I agree with the gracefully handle et al. I was merely pointing out that on freeBSD using Core, the arc would never actually grow to 99% even though that’s theoretically possible as it’s the max (default). That’s it! Unless everything other than zfs can fit within the 1GB on Core. Which if you move concept that over to Scale, would not be possible (fitting with the 1GB). Not sure how else to say it.

I am not saying there isn’t a memory pressure issue, never have (intentionally). And yes, again, my post clearly stated (to me at least) that manually picking the limit based on your systems knowledge works just fine. And I stated that’s not the same at all as the system making the decision to up memory to what it thinks it can like in core and now Scale. ZFS seems slow to release memory on Linux, it’s often behind. That’s exactly why I keep swap space and think it’s wise to.

All my Linux systems ever have consumed some swap space, but we never actually see it doing much if any swapping in or out. Adjusting swappiness can help resolve that on Linux and prevent needless swapping out of stuff that likely should not be. My Cobia system, the previous system, all used swap and still do, without performance issues so presume whatever is being swapped out must not be terribly useful (active stuff) as I’ve never seen any swapping back in. Other systems may never swap out as you see on freeBSD most likely.

Here’s the definition:

“This control is used to define how aggressive (sic) the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.”

winnielinnie · May 6, 2024, 6:28pm

I have the power of anime and mods on my side! I don’t have to take that attitude from you!

I can act as nasty and mean as I want, and the good mods here will have my back!

(One second. Just received a private message.)

Okay. So ignore what I just said.

In seriousness, I agree with what you wrote. And let this be a lesson about being careful to upgrade to versions that end with a .0.

sfatula · May 6, 2024, 6:33pm

I never updated Mac to x.0, never updated Debian to x.0, never update IOS to x.0, never update Scale to x.0, never updated freeBSD to x.0, you get the picture. To me, this is very wise logic. If it were somehow necessary, use a test system first with good test scripts and cases. Heck I don’t even update apps (in my case custom apps) automatically as again they may be a .0 release.

On Scale, I monitor stuff like this and will never update until it is resolved, all the key things. For me, a x.0 release is a little better than a Beta.