Boot TrueNAS Scale without zPool (kernel panic)

SAL · October 10, 2024, 3:38pm

Hello,

I know this was possible under Core in the Advanced options in GRUB, but how could I force to boot my TrueNAS Scale install without trying to address or mount the zPool ?

I have some nVME disks that have failed not only causing the pool to be degraded/offline, but also causes the machine to hang while booting resulting in a kernel panic.

I would like to get into the Truenas install and then try to import the pool in read-only with a combination of disks to try to recover as much data as possible.

It was a 4 disk pool consisting of a stripe of 2 mirrored vDEVs, so normally 1 disk per VDEV could fail, my I guess I got lucky (sarcasticly speaking) and two disks in the same VDEV failed.

I learned my lesson of trying to use consumer grade disks… Will replace them by either SATA/SAS enterprise SSDs with power loss protection with an NVME LOG (U.2 or U.3) VDEV and NVME Cache or outright 4 enterprise NVME disks without a LOG.

I was using this appliance as a SAN for my VMs, mounted the storage via iSCSI to a VMWARE cluster via 2* 40 Gbit NICs…

Any other ideas/suggestions are also welcome regarding to the data recovery.

EDIT: I know it is the disks, because I am able to boot into Truenas with only the 2 good disks installed. I verified the slots also, it is really the disks.

Farout · October 10, 2024, 4:10pm

Im not sure why this would help, but

Option 1 Export the Pool, it will then not be imported on the next startup
Option 2 How to disable zpool auto-import? | TrueNAS Community

BUT, losing both drives of the same vdev in a striped mirror configuration means total loss of data - also the data that is on the other drives.

You might have some luck recovering some data with https://www.klennet.com/ (USD 399 for recovery, free to just see ifs there even a chance)

SAL · October 10, 2024, 4:27pm

Manny thanks for the suggestions. I was hoping to recovery one of the drives, just long enough so I could copy the important data or replicate the pool.

Normally I had replication to another machine in a different location, but it seems like it stopped working a few months back for some reason and I had not noticed.

I have backups of the VMs etc of 6 months ago, so in the worst case I’ll just do it that way and update the needed config, but was hoping to maybe get a bit closer to the last config they were running…

Doesn’t seem to be worth the hassle, especially as I can pefectly rebuild the VMs, it’s just a lot of work.

SAL · October 11, 2024, 4:11am

Ok, I got the system booted by finding the bad drive, removing it from the sysem, booting the system without the drive, exporting the pool and rebooting with the the least bad bad drive, if that makes sense, inserted.

Then I proceeded to mount the pool as readonly, and this permitted me to perform some troubleshooting.

My suspition was right in that 2 disks of the same murror failed.

root@truenas[/home/admin]# zpool status
pool: boot-pool
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using ‘zpool upgrade’. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:00:36 with 0 errors on Sat Sep 7 03:45:37 2024
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      sda3      ONLINE       0     0     0

errors: No known data errors

pool: zPool1
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 0B in 00:06:58 with 0 errors on Thu Aug 8 06:41:54 2024
config:

    NAME                                      STATE     READ WRITE CKSUM
    zPool1                                    DEGRADED     0     0     0
      mirror-0                                DEGRADED     0     0     0
        aca08f45-06bd-4c3c-957f-1e045967d7e6  ONLINE       0     0     0
        f0f3391f-394c-4975-9d13-8558b44024c3  REMOVED      0     0     0
      mirror-1                                ONLINE       0     0     0
        42a8f2ee-7838-4a43-a551-e169f3f08eff  ONLINE       0     0     0
        1f6b885c-59fd-47b0-8572-ec8c1eccf6cf  ONLINE       0     0     0
    logs
      6eda03f6-2298-41ed-a96e-bdb387760130    ONLINE       0     0     0

errors: 8 data errors, use ‘-v’ for a list

the removed one makes the system hang even without mounting the pool, the one that is marked as online gives a kernel panic as soon as I mount the pool as RW.

My idea is to order an identical disk, to insert only those 2 into a system and DD the bad one to the new one. Hopefully, with 3 disks, I should be able then to remount the pool without issues…

I will then proceed to buy some better disks, and to replicate the whole zpool to a spare chassis I have laying around, or just take a backup to another NAS and restore it onto the new disks…

I also found that my disks in particular, the NM790 from Lexar, has a lot of issues with Linux that were fixed in kernel 6.10 apparently. It is really a garbage quality SSD… I had barely 50 TBW on the most used disks.