Dragonfish 24.04.1.1 GUI goes unresponsive

I’m not sure if this is memory leak related or not but after updating to DragonFish 24.04.1.1 and disabling by lru_gen fixes my system goes into la la land about every 24 hours. The main admin GUI goes unresponsive and I can’t login or ssh to the main system but all my virtual machines stay running without issue. Only the admin GUI and my apps go unresponsive. The only way to get back control is to remote power cycle the server. Is anyone else seeing this?

Have you tried any sort of debugging /investigation using local console or IPMI when it’s inaccessible?

1 Like

Supermicro 64GB SSD DOM boot drive

Are you able to check the wear-leveling or total writes pushed to these? Based on the vintage of the rest of the components in your system, the boot device might be worn out.

Not really as old IPMI console doesn’t work any longer as it requires Java console which has been depreciated for sometime.

What about using keyboard / monitor to investigate?

I have not and yes while it’s old hardware the SSD DOM drives were fairly new and have performed without issue up until moving to the release version of Dragonfish. My other system running CORE is identical and running without issue. Previously this system running the RC of Dragonfish ran without issue. Only started having issues after moving to the Release version as with other folks.

Checked that as well. Nothing on the console but the normal number menu of options.

I’m talking about doing things like getting shell and checking if it is responsive, investigating for networking issues, etc.

I have actually not tried to do the shell from the console just remotely via SSH or the web GUI. From what I’ve seen the network stack is up and working as I can ping the system from others and get a response but when I try and login it just sits. Also when going to the web GUI it just spins so the connection is there as it doesn’t say not responding, again it just spins. I have an SSH session going with HTOP on the advice of one of the admins here and have opened a debug ticket with my debug file.

locked up again this weekend. i checked the shell from the console and it was unresponsive, i.e. i selected the shell option from the menu and it just sat there and would never give me a prompt to do anything. the htop i was running in an SSH session appears to have died and kicked me out but would not give me a prompt either, just sits there.

image

managed to grab a copy of my htop before it crashed again. looks like the ARC is filing up.

I replied elsewhere, but that screenshot shows a LOT of free memory still. Your previous message shows that you can’t even launch htop when it goes catatonic. I’m looking for something pointing to 100% memory utilization. I am a tad concerned about the Sata DOM you are using. Those are notorious for failures, and it could be that the boot device is also stopping IO for completely other reasons which would better fit the symptoms described so far.

If it was the Sata DOM wouldn’t other services lock up besides just the GUI and SSH? All the other services on the box are staying running and are not impacted.

No. Services already up and running in memory can stay running for quite some time. The symptom you usually see is when you SSH in and try to run some commands not already in memory cache it tends to hang. Middleware / WebUI are usually first victims because they load specific things on-demand and periodically, so if boot goes catatonic they will halt.

1 Like

@kris thanks for this. it helps. i’ve got a brand new Sata DOM that i can give a shot but if i do it will be a brand new fresh install and might skew the debugging if it goes away and this is tied to the upgrade from CORE to Dragonfish RC to Dragonfish Release.

Try adding one sata device as a mirror of the other. This can be done in the GUI.

Then if you want you can remove the original device. Or not.

Wow… so what is iXsystems using these days as a boot drive in their Minis?

My MiniXL shipped with a 16GB SATADOM drive, IIRC. I have since upgraded to a mirrored pool of 64GB SM branded SATADOMs, likely twins to the ones used by @Spunky17

1 Like

I’d be velcroing a proper SSD somewhere into the case

Also, you may find this interesting:

https://ixsystems.atlassian.net/issues/NAS-100168

1 Like

That’s a bit discouraging, especially since neither CORE nor SCALE GUI allow anything beyond a 2-way mirror. The CLI allows more, IIRC.

Not the middlewared on scale when I tried it.

But unplug one. Then replace the removed device. And now you have a physical backup :wink:

1 Like