Dragonfish 24.04.1.1 hangs during boot after upgrade from 24.04.0

nco · June 4, 2024, 2:08pm

I have a Proxmox hypervisor (8.2.2) running a single VM for TrueNAS. Disks are on a sata controller that is passed through in hardware. It was on Cobia until a month or so ago, and then upgraded to Dragonfish 24.4.0. All running fine.

This afternoon, I tried the upgrade to 24.04.1.1, and it hung after rebooting, just after the Finished zfs-mount.service line. CPU usage on the VM was pegged at 67%; the vm has 3 CPU cores allocated, so this is 2 cores at 100%. It stayed this way for 30 mins, unresponsive. Can’t SSH in, no shell to interact with. Tried to gracefully shutdown / reboot using Proxmox controls and nothing worked.

I’ve rebooted and started up using the 24.04.0 boot environment and all is ok again. I’ve checked /var/log/syslog and kern.log, but neither have any record of the failed boot. Both show the shutdown before the upgrade and the boot after reverting the boot environment. If anyone can point me in the direction of alternate logs, I’ll post them.

This seems to be very similar to this report, however my VM has 48GB RAM assigned.

Stux · June 4, 2024, 9:45pm

Yeah. Sounds like the issue I was having when TNS was the hypervisor.

I spent three days trying to make it work.

I needed to use 32GB to import an essentially empty pool. And I’m not even sure if that was reliable.

Only solution I could find to boot with the same ram requirements as 24.4.0 was to use 24.4.0

Try booting without the controller passed through, just to see if you can get to the console.

sfatula · June 5, 2024, 4:50am

Any reason to not submit a bug report? I suspect IX would want to find the cause and fix it, given that everything is fixing to change with Electric Eel and more folks might be wanting to try it out or use the VM of Eel to migrate their stuff and test it before live updating.

Stux · June 5, 2024, 5:03am

Was working on it

https://ixsystems.atlassian.net/browse/NAS-129406

nco · June 5, 2024, 9:47pm

I tried… but I got shot down without them even reading past the first two lines.

Which is especially annoying when @HoneyBadger was only extolling the virtues of VMs a few days ago and the linked blog post emphasises how much their team use them too.

I was happy to help work to find the bug that’s obviously been introduced in the last update, but it’s frustrating when its ignored.

sfatula · June 5, 2024, 11:33pm

Maybe Stux will have better success, after all, he’s an MVP!

Stux · June 5, 2024, 11:36pm

Didn’t stop my other two regression reports being shot down with excuses

Reports of 24.04.1.1 having unstable guest networking and hanging on boot are building… check Reddit too.

I guess eventually there will be a critical mass and then a fire drill.

That’s usually the way it works.

kris · June 6, 2024, 4:27pm

Both issues are interesting and we are trying to keep tabs on how often / where they are reported. As always details matter, so lets gather up as much as we can and see if anything stands out as an important clue. We did investigate and have a resolution to the one issue with super long boot-times, but likely unrelated:

https://ixsystems.atlassian.net/browse/NAS-128561

I’m wondering if the kernel update in 24.04.1 (to fix SATA port multipliers) broke something. We’ve just merged in a further update to Kernel that will be in Dragonfish nightly images soon (Tomorrow). Would those with a reproduction case be willing to give it a whirl to see if behavior changes?

kris · June 6, 2024, 4:35pm

Also have to check if the LRU memory change contributed to this somehow. Long shot, but worth looking

HoneyBadger · June 6, 2024, 6:31pm

We tend to use XCP-NG and VMware for our TrueNAS VM testing, and I’ll note personally that I’ve got several 24.04.1.1 VMs going without issue.

It’s possible that there’s a KVM-specific problem here, possibly introduced as a result of the kernel changes in the .1 release.

sfatula · June 6, 2024, 7:14pm

My concern is we have a lot of users who may want to actually test Eel before deploying. Due to getting rid of kubernetes, probably more than any other release. I sure will, and I never tested before. So, if this remains a problem on Eel that may be a barrier to adoption of Eel. There is no way I will ever migrate to Eel if I can’t get a test system up and running to potentially modify my stuff. And I am not buying another server just to do that, as many home users will not do either.

HoneyBadger · June 6, 2024, 7:25pm

Hmm. Fresh install of 24.04.1.1 did its first boot clean, but subsequently hung up on middlewared load. Host is a 24.04.1.1 machine as well.

kris · June 6, 2024, 8:15pm

Great, since we have repro cases now I expect we’ll make some forward progress on this soon!

Stux · June 6, 2024, 9:52pm

Yes. I noticed a few potentially interesting commits in the delta.

Was about to try bisecting kernels backwards.

I figured I could use nighties

But there nighties don’t track all the kernels

So, if TrueNAS is an appliance… in order to bisect kernel bugs from upstream we need to be able to :

drop in test kernels
have test kernels to test.

So, what I’m getting at is we need not just a nightly train but a kernel bleeding edge train too.

Is it possible to have an automated “the nightly but with a bleeding edge kernel” build?

Stux · June 6, 2024, 10:01pm

I don’t think it’s the lru_gen disabling, but it might be the swap disabling

Iirc, memory usage can spike when importing a pool

Stux · June 7, 2024, 3:10am

I will as long as I can safely switch back to my previous boot environment. Which I think I should be able to.

kris · June 7, 2024, 1:31pm

Ok, looks like a nightly image built which should have the newer 6.6.32 kernel now, updated from 6.6.29:

ISO Image:
https://download.sys.truenas.net/truenas-scale-dragonfish-nightly/TrueNAS-SCALE-24.04.2-MASTER-20240607-013916.iso

Manual Update File:
https://update.sys.truenas.net/scale/TrueNAS-SCALE-Dragonfish-Nightlies/TrueNAS-SCALE-24.04.2-MASTER-20240607-013916.update

These can both be used for testing to see if the issue has been resolved or behaves in any new manner.

EDIT: Had wrong links by mistake, updated to the proper link with 6.6.32.

HoneyBadger · June 7, 2024, 2:52pm

Success with the same config (24.04.1.1 host) and the nightly (24.04.2-20240607)

Stux · June 8, 2024, 4:08am

I’ve been able to successfully boot a clean install of the nightly 6x in a row…

will test further

nco · June 20, 2024, 11:47am

I’ve just been able to test, the manual update works fine, so that looks like a fix Thanks Kris.