TreuNAS Scale - Pool randomly corrupted after 24.10.1 update

Hello,

I am a very novice user to NAS/self serving builds and have been using a Scale setup for about 4 months (details at bottom). Last night I updated to 24.10.1 after upgrading my router to start making moves to other upgrades. This seems to have caused some issues with my apps as I had to reconfigure my Plex server because it was no longer usable.

I was in another room streaming a video off my NAS when it suddenly stopped and I could not get it to reload the video. When I logged into my VM, it was stuck in a boot loop ending at “Job ix-zfs.service/start running 50s / 15min 34s”. I can audibly hear my drives working, then suddenly halt, and the system reboots after a brief hangup.

This setup has been working for a few months now without any issues. I did recently replace a drive but it silvered fine and all was healthy.

I spun up another VM with a fresh install of 24.10.1. It booted fine and I passed the drives fine. When I attempt to import my drive pool, it starts but after 10 seconds or so, I can see it reboot in the Proxmox CLI and the webUI disconnects until everything reboots.

I don’t want to mess with anything and lose the pool because I have a backup, but it’s not the most recent and there are personal items within the pool that I have been organizing and haven’t had a chance to make a more recent backup.

Does anyone have any ideas? I have done some searching and see things about rolling back, but cannot seem to get anything going when I boot to the advanced options GRUB menu. I’ve actually had no luck entering any usable commands within that terminal outside of “help” :rofl:

I figured that an entirely new instance of TrueNAS Scale would resolve any configuration issues, but now it seems as if the issue lies within the drive pool and I have no idea what could have happened or how to fix it.

Build Details:
Motherboard: ASUS PRIME B760M-A AX LGA 170
Processor: Intel Core i5-12600K
RAM: Kingston FURY Beast RGB 64GB KF552C40BBAK2-64
Data Drive:8x WD Ultrastar DC HC530 14TB SATA 6G Drives
Host Bus Adapter: LSI SAS 9300-16I in IT Mode
Drive Pool Configuration: Raid-Z1
Machine OS: Proxmox VE 8.3.2
NAS OS: TrueNAS Scale 24.10.1

Have you passed through the HBA to the VM?

Have you blacklisted the HBA in Proxmox?

If not, then it is quite likely that Proxmox is importing the same pool at the same time as TrueNAS leading to pool corruption.

The HBA has been passed through, along with the drives, for months without issue.

I do not know what you mean about blacklisting the HBA in Proxmox.

The pool was made within TrueNAS only - I do not have any pools created on within Proxmox

zpool import shows the pool and shows the pool along with all disks as ‘ONLINE’
(I cannot upload media - maybe a new member limitation?)

I was able to remove the passed drives from Proxmox > TrueNAS and successfully boot into my original TrueNAS instance. I could see my pool and the drives were shown as ‘exported’.

With TrueNAS running, I passed the drives back and from the TrueNAS CLI within Proxmox I ran ‘zpool import’ which showed me all of the drives and the pool as ‘ONLINE’.

When I ran ‘zpool import [POOL NAME]’ it started the import process, halted, and rebooted.

Do you get any sort of stack trace in the console environment just before reboot?

Sorry for the bump updates. I am actively looking for a solution by reading through forums and pages.

I came across this post ( on Truenas forums pool-cant-be-imported-after-exported-via-export-disconnect-solved.111475/ ) and wondered about post #4 stating

“Did you use PCIe passthrough in ESXi to give an entire HBA/controller to TrueNAS to which the disks were connected or did you pass the disks to the VM? If the latter this perfectly explains why you cannot import the pool. Boot ESXi, reconstruct your VM settings, import the pool, copy all your data, build the new system without ESXi and create a new pool, then copy back your data.”

I only passed the disks using /sbin/qm set [VM #] -virtio[drive #] /dev/disk/by-id/[drive ID] and not by passing the entire HBA card.

I sure wish I was allowed to upload images. I am not sure what you mean by “stack trace”

I see a bunch of initialization stuff-
Mounted /boot/grub
load kernel mods (several lines)
ZFS Poool Import Target
Wait for ZFS Volume (zvol) links in /dev
ZFS Volumes ar ready
Mount ZFS Filesystems
TrueNAS Middleware
Sync Disk Cache Table
Generate TrueNAS /etc files
Setup TrueNAS network.
then
Job ix-zfs.service/start running (XXs / XXmin XXs) is where it hangs.

If you do NOT blacklist the HBA inside Proxmox, then (e.g. a Proxmox system update) can cause Proxmox to import and mount the same pool simultaneously with TrueNAS, and two systems simultaneously mounting and changing a pool does not end well.

see below, I’m not familiar at all with Proxmox so no idea if this would make any meaningful difference:

Are we able to test an import on something like a bare metal install of TNS (i.e. booting from a USB just to confirm the pool can import)?

If it’s still crashing at that point it’s probably safe to assume there’s some sort of corruption and we’ll need to start looking at recovery options (potentially importing with zfs_recover=1 to ignore the panic entirely, though this should be a last resort).

Well, I’m stumped and it seems like possibly not good news for me.

Installed TrueNAS on another M.2 and booted everything up. Fresh install without any additional configuration, apps, or anything like that.

I was hopeful because I can see the drives and pool (shows as exported) within TrueNAS but when I try to import, it does the same thing… runs for around 30 seconds to 1 minute, halts, and reboots the system.

When I input zpool status in the CLI, it does not show the pool at all like it was within my virtual TrueNAS .

When commanding ‘zpool import’ in the CLI, I am able to see the pool. The state shows as ‘ONLINE’ and the action tag says ‘The pool can be imported using its name or numeric identifier.’

I am showing ‘ONLINE’ from the pool name, through the RAID configuration (raidZ1-0) and all 8 drives.

However, this is the same result I had last night on the virtualized TrueNAS and when I attempted to import at this point, it ended the same way as importing within the GUI - momentary work and a sudden reboot.

I attempted just now to verify all is the same, and it says:

mpt3sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
sd 0:0:3:0: Power-on or device reset occurred
sd 0:0:3:0: Power-on or device reset occurred
mpt3sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
mpt3sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
mpt3sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
sd 0:0:3:0: Power-on or device reset occurred
sd 0:0:3:0: Power-on or device reset occurred

There are numbers in brackets to the left of all of this - if it helps with troubleshooting, please let me know and I will retype this all again.

Now that the computer has reset, TrueNAS is failing to start and shows

Job middlewared.service/start running (XXs / Xmin XXs)
Job middlewared.service/start running (XXs / Xmin XXs)
sd 0:0:4:0: Power-on or device reset occurred
Job zfs-import-cache.service/start running (XXs / no limit)
mpt3sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
mpt3sas_cm0: log_info(0x31110d01): originator(PL), code(0x11), sub_code(0x0d01)
Job zfs-import-cache.service/start running (XXs / no limit)
sd 0:0:4:0: Power-on or device reset occurred
sd 0:0:4:0: Power-on or device reset occurred

Apologies for the delay in response, have been in the office today.
The fact you’re still hitting what seems like a panic as soon as you import the pool, even on a bare-metal fresh install, is not a great sign.

Do you have backups? And if no, do you have the capacity to create a backup of what you need in the event you need to rebuild the pool?
We can try importing in readonly with recovery enabled, but obviously suppressing panics could cause even more issues.

I had a similar problem in a completely different hardware environment: attempt to import a pool was resulting in a kernel panic (not reboot). This started happening after I upgraded the TrueNAS Scale version and rebooted…

The only way forward for me was importing the pool readonly (-o readonly=yes), copying over to a different TN Scale and redoing the pool…

Importing readonly takes only about 3 sec (~60TB RAIDZ2) and always worked. I believe now the problems started because of bad stick of RAM (I used non-parity)…

Yup, exactly what I was going to suggest.

I’ve seen this same behaviour a couple of times with metaslab/spacemap corruption - though without an actual stack trace it’s very difficult to pin down exactly what the root cause is (and the system seems to halt before ever spitting it out to console!).

General process should be as follows:
zpool import -fn poolname (confirm if possible with a dry run)
zpool import -f -o readonly=on poolname -R /mnt (attempt import in readonly, force import, maybe necessary with -Ff if this fails for Force Rewind, force import)

Assuming no crash, copy important data off of the pool and rebuild.
Of course, if you don’t have the capacity to actually move that data to another pool, things become a bit more difficult…

On the old TrueNAS forum there was advice to boot from a live Ubuntu USB/CD, mount the pool, export it and try again - boot normal and try import. The idea being: Ubuntu uses a different kernel and it might help…

It never worked for me - importing in Ubuntu always failed (and did not work by pool name, only by id)…