Persistent recurring alert after upgrade to 24.04.0 "Failed to check for alert ScrubPaused: Failed connection handshake"

Ever since my upgrade to 24.04.0, I’ve been getting email alerts several times a day. They usually clear within a minute, but I don’t like getting used to ignoring alerts and want to know if there’s something I can do to correct it. The alert is:

Failed to check for alert ScrubPaused: Failed connection handshake

I searched, and found some results of people experiencing this transiently, but haven’t seen a solution at all.

1 Like

Im seeing the exact same thing. I’m considering completely re-installing the OS and restoring from a backup config because I’m also having some strange issues where I cannot log in when there is a moderate load on the network card. This is even despite my “storage traffic” and my UI or management traffic are on completely separate NICs. Never had the issue before the upgrade.

I also have this error since I upgrade to 24.04.0

This error (and those like it [failed to check for … connection handshake… ]) are a symptom of resource exhaustion of some kind. Given the version(s) presented above and also in several related JIRA tickets that have been logged, it is possible this is an issue that would be resolved by upgrading to SCALE-24.04.1 or by applying the workarounds for swap/memory condition. If anyone has seen this issue after upgrading to .1, then that would be useful to know.

1 Like

Can you expand on those workarounds? I recently upgraded my NAS from 16GB->64GB RAM (did not correlate with the alerts, this happened after my upgrade to 24.04.0). I’d be interested in trying that, but I’m not sure what I’m looking for.

I’ll see if I can do the .1 upgrade shortly.

My thread about another issue had some sysctl values you can change to improve this. It is resource exhaustion. I don’t remember which one exactly.

1 Like

Not seeing how this could be resource exhaustion for me. Have a dual socket CPU and 128gb of RAM. Only 30ish GB used for services. Everything runs off sata or nvme ssds. I have yet to upgrade to .1

Only reason I even notice the error is because of email alerts. The error clears within 1 minute of it triggering. Seems like a timeout is too short imo.

Upgraded to 24.04.1, gonna let it sit for a couple days to monitor.

I have a 64-core and 512gb of ram with only 40gb used for services, apps all running from a 10 x optane pool, and still ran into sysctl limits. But, maybe if you do nothing the problem will fix itself.

Since going to 24.04.1 this hasn’t happened again, for me this is SOLVED.

1 Like

In 24.04.0 lru_gen and zfs_arc have a fight over free memory and swap. Your system loses.

24.04.1 disables lru_gen, zfs_arc wins. Yeay.

This problem is present for me in 25.04.1, with 128 GB of RAM and dual Xeon 6132s (and typically tens of GB of RAM free, because SCALE still doesn’t keep ARC in RAM the way CORE did). Is this a new issue, or is it that the old one hasn’t been fixed?

I would expect this to have been fixed and you should attempt filing a Bug Report.
Maybe they, ‘unfixed’ it.

https://ixsystems.atlassian.net/browse/NAS-136556

I just saw this for the first time on my system last night. Specifically:
Failed to check for alert Quota: Failed connection handshake

Also, typically large amounts of spare RAM on my system, usually at least 20-30GB

I’ve had these failed connection handshake alerts for various services five times now over the last week.

Did you take a look at the JIRA ticket @dan had posted above? Something about a container with a lot of python zombie processes, in that case.

I have also started to see this recently. I don’t seem to have any zombie processes or child processes however.

You might want to use Report A Bug in the TrueNAS GUI. Smile icon on upper right for Feedback / Report A Bug. You could also create a new thread. This one is marked Solution and your setup may be different. Please include details on your system, if you open a new one.

i just got this alert on my system while it was running a scrub. the alert cleared after 1 minute. the scrub is continuing to run without issues.
i am not sure if i have any zombie processes, what am i looking for to tell if they are zombie?

my system: 45labs HL15 (1.0)
128GB of RAM
NVidia RTX A400
Xeon Silver 4216