Persistent recurring alert after upgrade to 24.04.0 "Failed to check for alert ScrubPaused: Failed connection handshake"

Ever since my upgrade to 24.04.0, I’ve been getting email alerts several times a day. They usually clear within a minute, but I don’t like getting used to ignoring alerts and want to know if there’s something I can do to correct it. The alert is:

Failed to check for alert ScrubPaused: Failed connection handshake

I searched, and found some results of people experiencing this transiently, but haven’t seen a solution at all.

1 Like

Im seeing the exact same thing. I’m considering completely re-installing the OS and restoring from a backup config because I’m also having some strange issues where I cannot log in when there is a moderate load on the network card. This is even despite my “storage traffic” and my UI or management traffic are on completely separate NICs. Never had the issue before the upgrade.

I also have this error since I upgrade to 24.04.0

This error (and those like it [failed to check for … connection handshake… ]) are a symptom of resource exhaustion of some kind. Given the version(s) presented above and also in several related JIRA tickets that have been logged, it is possible this is an issue that would be resolved by upgrading to SCALE-24.04.1 or by applying the workarounds for swap/memory condition. If anyone has seen this issue after upgrading to .1, then that would be useful to know.

1 Like

Can you expand on those workarounds? I recently upgraded my NAS from 16GB->64GB RAM (did not correlate with the alerts, this happened after my upgrade to 24.04.0). I’d be interested in trying that, but I’m not sure what I’m looking for.

I’ll see if I can do the .1 upgrade shortly.

My thread about another issue had some sysctl values you can change to improve this. It is resource exhaustion. I don’t remember which one exactly.

1 Like

Not seeing how this could be resource exhaustion for me. Have a dual socket CPU and 128gb of RAM. Only 30ish GB used for services. Everything runs off sata or nvme ssds. I have yet to upgrade to .1

Only reason I even notice the error is because of email alerts. The error clears within 1 minute of it triggering. Seems like a timeout is too short imo.

Upgraded to 24.04.1, gonna let it sit for a couple days to monitor.

I have a 64-core and 512gb of ram with only 40gb used for services, apps all running from a 10 x optane pool, and still ran into sysctl limits. But, maybe if you do nothing the problem will fix itself.

Since going to 24.04.1 this hasn’t happened again, for me this is SOLVED.

1 Like

In 24.04.0 lru_gen and zfs_arc have a fight over free memory and swap. Your system loses.

24.04.1 disables lru_gen, zfs_arc wins. Yeay.