Dragonfish 24.04.1.1 hangs during boot after upgrade from 24.04.0

I have a Proxmox hypervisor (8.2.2) running a single VM for TrueNAS. Disks are on a sata controller that is passed through in hardware. It was on Cobia until a month or so ago, and then upgraded to Dragonfish 24.4.0. All running fine.

This afternoon, I tried the upgrade to 24.04.1.1, and it hung after rebooting, just after the Finished zfs-mount.service line. CPU usage on the VM was pegged at 67%; the vm has 3 CPU cores allocated, so this is 2 cores at 100%. It stayed this way for 30 mins, unresponsive. Canā€™t SSH in, no shell to interact with. Tried to gracefully shutdown / reboot using Proxmox controls and nothing worked.


Iā€™ve rebooted and started up using the 24.04.0 boot environment and all is ok again. Iā€™ve checked /var/log/syslog and kern.log, but neither have any record of the failed boot. Both show the shutdown before the upgrade and the boot after reverting the boot environment. If anyone can point me in the direction of alternate logs, Iā€™ll post them.

This seems to be very similar to this report, however my VM has 48GB RAM assigned.

Yeah. Sounds like the issue I was having when TNS was the hypervisor.

I spent three days trying to make it work.

I needed to use 32GB to import an essentially empty pool. And Iā€™m not even sure if that was reliable.

Only solution I could find to boot with the same ram requirements as 24.4.0 was to use 24.4.0

Try booting without the controller passed through, just to see if you can get to the console.

Any reason to not submit a bug report? I suspect IX would want to find the cause and fix it, given that everything is fixing to change with Electric Eel and more folks might be wanting to try it out or use the VM of Eel to migrate their stuff and test it before live updating.

Was working on it

https://ixsystems.atlassian.net/browse/NAS-129406

1 Like

I triedā€¦ but I got shot down without them even reading past the first two lines.

Which is especially annoying when @HoneyBadger was only extolling the virtues of VMs a few days ago and the linked blog post emphasises how much their team use them too.

I was happy to help work to find the bug thatā€™s obviously been introduced in the last update, but itā€™s frustrating when its ignored. :man_shrugging:

2 Likes

Maybe Stux will have better success, after all, heā€™s an MVP!

Didnā€™t stop my other two regression reports being shot down with excuses

Reports of 24.04.1.1 having unstable guest networking and hanging on boot are buildingā€¦ check Reddit too.

I guess eventually there will be a critical mass and then a fire drill.

Thatā€™s usually the way it works.

Both issues are interesting and we are trying to keep tabs on how often / where they are reported. As always details matter, so lets gather up as much as we can and see if anything stands out as an important clue. We did investigate and have a resolution to the one issue with super long boot-times, but likely unrelated:

https://ixsystems.atlassian.net/browse/NAS-128561

Iā€™m wondering if the kernel update in 24.04.1 (to fix SATA port multipliers) broke something. Weā€™ve just merged in a further update to Kernel that will be in Dragonfish nightly images soon (Tomorrow). Would those with a reproduction case be willing to give it a whirl to see if behavior changes?

Also have to check if the LRU memory change contributed to this somehow. Long shot, but worth looking :slight_smile:

We tend to use XCP-NG and VMware for our TrueNAS VM testing, and Iā€™ll note personally that Iā€™ve got several 24.04.1.1 VMs going without issue.

Itā€™s possible that thereā€™s a KVM-specific problem here, possibly introduced as a result of the kernel changes in the .1 release.

1 Like

My concern is we have a lot of users who may want to actually test Eel before deploying. Due to getting rid of kubernetes, probably more than any other release. I sure will, and I never tested before. So, if this remains a problem on Eel that may be a barrier to adoption of Eel. There is no way I will ever migrate to Eel if I canā€™t get a test system up and running to potentially modify my stuff. And I am not buying another server just to do that, as many home users will not do either.

3 Likes

Hmm. Fresh install of 24.04.1.1 did its first boot clean, but subsequently hung up on middlewared load. Host is a 24.04.1.1 machine as well.

3 Likes

Great, since we have repro cases now I expect weā€™ll make some forward progress on this soon!

2 Likes

Yes. I noticed a few potentially interesting commits in the delta.

Was about to try bisecting kernels backwards.

I figured I could use nighties

But there nighties donā€™t track all the kernels :slight_smile:

So, if TrueNAS is an applianceā€¦ in order to bisect kernel bugs from upstream we need to be able to :

  1. drop in test kernels
  2. have test kernels to test.

So, what Iā€™m getting at is we need not just a nightly train but a kernel bleeding edge train too.

Is it possible to have an automated ā€œthe nightly but with a bleeding edge kernelā€ build?

I donā€™t think itā€™s the lru_gen disabling, but it might be the swap disabling

Iirc, memory usage can spike when importing a pool

I will as long as I can safely switch back to my previous boot environment. Which I think I should be able to.

Ok, looks like a nightly image built which should have the newer 6.6.32 kernel now, updated from 6.6.29:

ISO Image:
https://download.sys.truenas.net/truenas-scale-dragonfish-nightly/TrueNAS-SCALE-24.04.2-MASTER-20240607-013916.iso

Manual Update File:
https://update.sys.truenas.net/scale/TrueNAS-SCALE-Dragonfish-Nightlies/TrueNAS-SCALE-24.04.2-MASTER-20240607-013916.update

These can both be used for testing to see if the issue has been resolved or behaves in any new manner.

EDIT: Had wrong links by mistake, updated to the proper link with 6.6.32.

2 Likes

image
Success with the same config (24.04.1.1 host) and the nightly (24.04.2-20240607)

2 Likes

Iā€™ve been able to successfully boot a clean install of the nightly 6x in a rowā€¦

:champagne:

will test further

1 Like

Iā€™ve just been able to test, the manual update works fine, so that looks like a fix :+1: Thanks Kris.

2 Likes