Added nvme drives, now cant boot

Getting alloc magic is broken at 0x25296fe0: 0 error when i try to boot. If I remove the new drives it boots back fine.

Seems its renumbering the device paths as my boot device, previously nvme1n1, is now listed as nvmen2n1.

Seems this is a grub issue, but I have no idea where to go from here.

Based on the very limited data yo provided:

The BIOS to change the Boot Device order. You should also be able to select a boot device during the BIOS checks, probably a function key or Esc.

Have you swapped the two NVMe drives?

What do I mean limited data?

What new drives? NVMe, SSD, Spinning Rust? My crystal ball has a crack in it so it is not reliable. Be very descriptive in what is happening. Do not assume we know what is going on.

Sorry I guess I could have put it in the post and not just the title.

I added two NVME drives.
When I boot up, even manually selecting my boot device, an m2 NVMe ssd, I get that error.

With the new drives added the pcie mapping of the boot device changes. I confirmed this by booting to a truenas installer usb stick and running lsblk.

Is your hardware as listed in your ‘Specs’ correct?

Yes I am adding these two u.2 drives with a pcie u2 carrier and using the motherboards bifurcation.

With bifiurcation disabled I can boot, but only see one of the new drives.
With bifurcation enabled and both drives installed I get the alloc error and cant boot.
With bifurcation enabled and either one drive installed in either slot I can boot.
With bifurcation enabled and one drive in the carrier and other connected to an m2 adapter I can boot.

I even re-installed my boot drive and restored my config with bifurcation enabled, both drives in the carrier and I could see them during the install. I figured this would overcome any kind of Grub non-sense. however as soon as i reboot I am greeted with same alloc error. At this point I can only guess some sort of mobo or pcie card issue, but the card is just a basic bifurcation card no plx switch.

At this point I am at a complete loss.

It the BIOS up to date? I suspect so but need to ask.
Are you running TrueNAS on bare metal?

Specifically what are the part numbers, make/model?
Which slot are you using?

A few things to try, I have no idea if this will help but reading about the error on the internet makes it sound like a ZFS drive gone bad.

This is what I would do and I’m not saying some of these steps couldn’t be skipped, but again, what I would do. I like to eliminate causes and then start adding back hardware/features.

These are just troubleshooting steps, you can revert back to normal at any time:

  1. Backup your working TrueNAS configuration, you may need it later.
  2. Power Down.
  3. Disconnect your hard drives.
  4. Do not have your NVMe adapter or drives installed.
  5. If you have a spare drive that you can use as a boot drive, connect it. If not, use your original boot drive.
  6. Wipe the boot drive. You can use The Ultimate Boot CD (UBCD) which has several utilities on it, or you could wipe the drive in another computer, but wipe, not format. Why? to ensure there isn’t something left behind to mess it up.
  7. Install a fresh copy of TrueNAS 24.10 to your wiped boot drive. NOTE: if you get a failure early, try 25.04.1 but you are using 24.04 so let’s start here.
  8. Ensure TrueNAS boots up without issue. You can configure the network manually at this point so you can use a web browser. Do not restore your config files.
  9. Does everything seem to be working, reboots are fine? If yes, continue. I expect no problems to exist but we are just verifying the installation seems to be working.
  10. Power down and unplug the wall plug (dead computer on the inside).
  11. With both NVMe drives installed in the Adapter card, install the card into the computer.
  12. Plug in and power up the system.
  13. Is the system responding normally or do you have a GRUB error again?
  14. If you have a GRUB error, you now have a minimum system setup to troubleshoot with, besides keeping your main pool vulnerable as you power up/down/reboot. you can add those back once you figure out this problem.
  15. If you do not have a GRUB error, run a SMART Long Test on each drive smartctl -t long /dev/nvme0 and nvme1. These typically take less than 20 minutes. Once done, check the SMART test results smartctl -a /dev/nvme0 and nvme1, esure you see Extended Test Complete and no errors. This takes care of this test.
  16. So far we have ruled out the NVMe drives as faulty, including the interface to the computer.
  17. Create a pool with the two nvme drives, this is just for testing.
  18. Once created, power down the system. Wait at least 20 seconds, power up. Ensure TrueNAS boots up and has not issues. The pool you created is there.
  19. Power down again.
  20. Now that you have confirmed you have a good system with the two nvme drives, let’s move on to connecting the other drives in your system.
  21. Connect up your 24TB Toshiba drives. Power up.
  22. At each point where you change the hardware configuration, make sure the system boots. We are not importing any pools at this point. You have one pool and one TrueNAS OS running. We are looking for hardware conflicts now.
  23. If you find a problem, roll back and verify the problem.
  24. If the four drives cause the failure, power down, disconnect one drive, power up, rinse and repeat until you find the offending drive. When you do find it, move it to a different data cable location to find out if it is the drive or the HBA (MB).
  25. Keep adding drives until all drives are back connected. Is the system working?

So these are the first steps to take. I hope you will find a hardware related failure. This can point us into the correct direction for more troubleshooting.

Good hunting!

1 Like