zfs pool cannot be imported after reboot

0x3d638c8e · December 27, 2024, 4:41am

I rebooted my TrueNAS Scale VM and upon rebooting, my zfs pool cannot be imported any longer.

The pool is called tank and consisted of 3x mirrors. All 4 disks are connected and recognized. They are connected via an LSA HBA in IT mode with exclusive PCI passthrough.

However, they don’t show the usual Apple ZFS partition.

zpool list does not show the pool.

zpool import -f -m 15783679572798201752 or
zpool import -f -m tank yields cannot import 'tank': no such pool available.

Running sudo zpool import -d /dev/disk/by-id tank yields cannot import 'tank': pool was previously in use from another system.

Using -f gives cannot import 'tank': one or more devices is currently unavailable. -m has no effect.

-o readonly=on does not work.

zdb largely shows correct output, except for two disks:

➜  by-partuuid sudo zdb -l /dev/sdd
failed to unpack label 0
failed to unpack label 1
------------------------------------
LABEL 2 (Bad label cksum)
------------------------------------
    version: 5000
    name: 'tank'
    state: 0
    txg: 11069250
    pool_guid: 15783679572798201752
    errata: 0
    hostid: 1828591293
    hostname: 'nas'
    top_guid: 7507722210272055471
    guid: 14463960403990065330
    vdev_children: 3
    vdev_tree:
        type: 'mirror'
        id: 1
        guid: 7507722210272055471
        metaslab_array: 73
        metaslab_shift: 34
        ashift: 12
        asize: 11997984063488
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 14463960403990065330
            path: '/dev/disk/by-partuuid/1c613add-a83d-411c-ad43-cb812423404f'
            whole_disk: 0
            DTL: 990
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 6570285282307898179
            path: '/dev/disk/by-partuuid/509ec09f-fee6-4ee1-aa77-62ef99ea855d'
            whole_disk: 0
            DTL: 400
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
        com.klarasystems:vdev_zaps_v2
    labels = 2

These partuuids do not exist. Trying to manually symlink them against the block device had no effect.

The 2nd disk seems corrupted:

sudo zdb -l /dev/sde
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

I am getting mildly desperate here.

Protopia · December 27, 2024, 12:48pm

3-mirrors (vDevs) should be at least 6 disks not 4. This doesn’t make any sense.

Please give a detailed description of what you expect the vDev configuration to be for this pool. Data vdevs? Log vDevs? (And if it doesn’t have an SLOG vdev, why did you attempt -m)?

Please run the following commands and post the output here (with the output of each command in a separate </> box):

lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
lspci
sudo sas2flash -list
sudo sas3flash -list
sudo zpool import

neofusion · December 27, 2024, 12:55pm

Oh boy, this looks like another dataloss issue caused by Proxmox importing your drives before/while TrueNAS also tried to import them. There have been a slew of similar posts this fall.

You would have avoided this by blacklisting your PCIe card in Proxmox, preventing it from loading any kernel module and then reading the disks.

Arwen · December 27, 2024, 1:09pm

We need to know what VM Hypervisor software you are using, and version.

Proxmox?
VMWare?
Other, if so what?

Their are lots of gotchas that if not setup right, can cause data loss.

Protopia · December 27, 2024, 1:09pm

Ah yes - I failed to spot the critical 2 letters “V” and “M” in close succession.

There needs to be a warning given during installation and boot when TrueNAS is being installed or run virtually and a BIG WARNING IN FLASHING NEON RED LETTERS if it is detected that the controller and / or disks are not passed through in the way needed.

Farout · December 27, 2024, 1:31pm

Here is my feature request that only received one vote (mine).

Protopia · December 27, 2024, 2:03pm

I just upvoted it.

0x3d638c8e · December 27, 2024, 2:06pm

Thank you all for the responses. I posted this relatively late, so apologies for the missing info.

Yes, it’s running in Proxmox. The HBA is passed through. This setup has been running fine for several years.

And it’s of course 6 disks, not 4, with a single data vdev (yes, with mixed capacity).

admin@nas[~]$ lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
NAME   MODEL                 ROTA PTTYPE TYPE      START           SIZE PARTTYPENAME             PARTUUID
sda    QEMU HARDDISK            0 gpt    disk               68719476736                          
├─sda1                          0 gpt    part       4096        1048576 BIOS boot                c6d2b0af-a2e3-4717-834d-36027df7172c
├─sda2                          0 gpt    part       6144      536870912 EFI System               21aafa5d-b906-45bd-8292-d435d7b098aa
└─sda3                          0 gpt    part    1054720    68179443200 Solaris /usr & Apple ZFS 6d8316da-2dc6-4a67-b27a-8a8b97f12a5b
sdb    WDC WD40EFZX-68AWUN0     1        disk             4000787030016                          
└─sdb4                          1        part 1382079942    39726268928                          
sdc    WDC WD120EDBZ-11B1HA0    1        disk            12000138625024                          
└─sdc4                          1        part 1382079942    39726268928                          
sdd    WDC WD120EDBZ-11B1HA0    1        disk            12000138625024                          
└─sdd4                          1        part 1382079942    39726268928                          
sde    WDC WD40EFZX-68AWUN0     1        disk             4000787030016                          
└─sde4                          1        part 1382079942    39726268928                          
sdf    WDC WD60EFRX-68L0BN1     1        disk             6001175126016                          
└─sdf4                          1        part 1382079942    39726268928                          
sdg    WDC WD60EFRX-68L0BN1     1        disk             6001175126016                          
└─sdg4                          1        part 1382079942    39726268928                          
sr0    QEMU DVD-ROM             1        rom                 1073741312

LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18) 
Copyright (c) 2008-2014 LSI Corporation. All rights reserved 

        Adapter Selected is a LSI SAS: SAS2008(B2)   

        Controller Number              : 0
        Controller                     : SAS2008(B2)   
        PCI Address                    : 00:00:10:00
        SAS Address                    : 500605b-0-056e-6dd0
        NVDATA Version (Default)       : 14.01.00.08
        NVDATA Version (Persistent)    : 14.01.00.08
        Firmware Product ID            : 0x2213 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9211-8i
        BIOS Version                   : 07.39.02.00
        UEFI BSD Version               : 07.27.01.01
        FCODE Version                  : N/A
        Board Name                     : SAS9201-8i
        Board Assembly                 : H3-25268-00D
        Board Tracer Number            : SP24531099

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
admin@nas[~]$ sudo sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

        No Avago SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.

admin@nas[~]$ sudo zpool import
no pools available to import

Protopia · December 27, 2024, 3:15pm

You have a 6x mirror i.e. the size of a single drive? Or do you mean RAIDZ1 5x the size of a single drive? Or something else?

When I ask for " a detailed description of what you expect the vDev configuration to be for this pool", I am clearly asking for detail not an (apparently still incorrect) summary.

If you don’t have precise details, I suspect that recovery will be impossible.
The lsblk output looks very odd:
- the partitions are not marked as Solaris /usr & Apple ZFS as I would expect them to be (like the boot pool partition is marked).
- the partitions don’t have PARTUUIDs - so the UUIDs in the labels won’t match up to a partition label.
- the partitions do not start at 2KB or 4KB but instead at 1.38GB.
- the partitions are NOT the full size of the disks - disk sizes correct: 40TB, 60TB, 120TB; partition sizes all 39.7GB.
So it looks to me like that partition tables have been overwritten, the partition sizes and types and UUIDs are wrong, and who knows what the likely consequence on the ZFS database might be.

If Proxmox tried to mount the pool too, then it is likely that the metadata blocks are also corrupt and possibly that some data blocks have been overwritten by metadata.

At this point, given the multiple types of corruption and the almost complete lack of detail, I think it will likely be impossible to recover the pool and you need to cut your losses and start to recover from the data loss i.e. fix the Proxmox definitions, clean the drives, recreate the pool from scratch and restore from backups.

Put simply, if you failed to blacklist your PCIe card in Proxmox, your luck has lasted several years longer than it might have expected to last.

As with most IT, the technology is only as strong as the weakest link. Adding complexity (like virtualising TrueNAS under Proxmox instead of running it natively) adds more links to the chain, and if you configure something incorrectly that is a weak link. If it went wrong at the start, then you might have corrected it, but when it works at the start you have an unsupported configuration where any error prevention or recovery code has never been used during normal operation, and consequently if or when it does go wrong it is much much more likely to go very very very wrong.

This is why the community recommends bare-metal TrueNAS installs and using TrueNAS virtualisation if it is good enough for your needs, and warn that running TrueNAS under Proxmox can be done successfully but needs careful configuration to avoid data loss.

0x3d638c8e · December 27, 2024, 4:39pm

I appreciate the help and warnings here, thank you.

You have a 6x mirror i.e. the size of a single drive? Or do you mean RAIDZ1 5x the size of a single drive? Or something else?

A single data vdev consisting of 3 mirrors of 2x4TB, 2x6TB, a 2x12TB respectively. This server is re-using old drives and I’m in the process of dedicating it as a backup machine.

recreate the pool from scratch and restore from backups.

Can do. I can live with that, as long as I can avoid this in the future and there’s nothing I can do here.

Put simply, if you failed to blacklist your PCIe card in Proxmox, your luck has lasted several years longer than it might have expected to last.

Just to recap my understanding here - during a restart, Proxmox somehow took control of the HBA, rather than patching it through to the VM, and something? messed with the partition layout? I’m not even aware what that could have been - the pool has never been imported anywhere but this VM.

I understand the need for exclusive access to the disks for zfs, but I’m not aware of anything in Proxmox actively doing anything to a device that isn’t used elsewhere.

This is why the community recommends bare-metal TrueNAS installs

I’m aware, as well as I’m aware of the “appliance” mantra, but that sometimes simply isn’t realistic. It changes the required hardware for a setup like this from “I can run a small homelab on hardware I have and make TrueNAS a part of it” to “I need dedicated machines, space, and probably a new HVAC run”. Following other best practices like mirrors instead of RAIDZ1, having a backup instance, and having cold storage already make this a very expensive hobby.

As I’ve outlined earlier, I’m making this particular server a dedicated backup box, and once I’ve got my new server set up, I can run TrueNAS directly on it (this time without a mixed capacity vdev). However, my new primary NAS server is going to be virtualized again. I will ensure to blocklist the disk and/or controller to the host when I get there.

neofusion · December 27, 2024, 5:26pm

I admit this is veering off-topic, but don’t recall seeing reports here like this a year ago.
I feel as if something changed, and while I can’t point the finger squarely at Proxmox, they are absolutely part of the discussion.

That sounds more like 3 VDEVs consisting of mirrored pairs of 4TB’s, 6TB’s and 12TB’s…

Something like that.

Proxmox has a “feature” to automatically scan for and import ZFS pools. This is on by default.
The problem arises when the VM later also goes to import the pool, as it has been set to do. Multiple hosts/clients importing and using the same pool at the same time is a huge honking NO-NO and corruption is a likely outcome.

Protopia · December 27, 2024, 7:04pm

No - absolutely not. A single pool containing 3x vDevs each of which is a mirror of 2 drives 2x4TB, 2x6TB, 2x12TB.

If Proxmox finds a ZFS pool when it boots, it can automatically import it. The reason you blacklist the HBA is to prevent Proxmox from using it, finding the disks and importing them.

If the disks get imported on both Proxmox and TrueNAS at the same time, corruption can and almost certainly will occur.

That said, I am not sure that the symptoms you have would result from that.

Using only mirrors regardless of your use case is NOT best practice. (But there is an idiot on Reddit who keeps advising people incorrectly that RAIDZ doesn’t perform well (for any use case) and that they should only use mirrors.)

Simple rules of thumb:

Active data, especially zVolumes and iSCSI and databases all do random i/o and should be synchronous writes on mirrored drives, ideally SSDs or if data is too large on HDDs with SSD SLOG.
Inactive rarely accessed data, and large sequential files should be asynchronous writes on RAIDZ, typically HDD. These do NOT need SLOG which will do nothing.
Specialised types of vDev other than SLOG (e.g. special allocation aka metadata, dedup, L2ARC) are only needed in extremely large servers.

prez02 · December 27, 2024, 7:12pm

You mean a single pool consisting of three vdevs, each of them being a mirror, striped together.

There is nothing against a raidz1 or raidz2. Only in your case, you’d loose a lot of capacity due to differently sized disks.

Edit: too slow

0x3d638c8e · December 27, 2024, 7:40pm

Yes, I suppose that’s what I meant. Apologies.

I’m still hoping somebody has a theory as to the actual root cause here.

From what I’ve gathered online (and reddit was certainly a source), the consensus for home use seemed that RAID-Z1 is going to be slower (whether noticeable or not remains to be seen) and that especially resilvering can take ages, which sounded like a scary prospect in case of a drive failure. Maybe I’ll move to RaidZ1 if i ever get a third disk for the new server, but the only advantage seemed to be the better storage utilization if one had 3+ identical (or close to identical) capacity drives, which I simply don’t.

My comment around best practices was moreso to sticking to 3-2-1 (ish) backups + redundant disks on top of that, which is not cheap.

Protopia · December 27, 2024, 8:35pm

There are two measures of performance - IOPS and throughput.

IOPS is relevant for frequent multiple small random I/Os - like zVolumes, iSCSI and databases.

Throughput is most relevant for sequential access to medium to large files.

But these measures are only really relevant when you are running either measure at high utilisation. For low usage systems, neither is that critical because e.g. network speed can easily be the limiting factor.
Mirrors are best for IOPS and read throughput, pretty good for write throughput BUT expensive. For mirroring, pairs of drives should be the same size.

RAIDZ is good for read and write throughput but very poor for IOPS, but is much cheaper for the same level of redundancy once you get to 4+ drives. (RAIDZ can actually have better write throughput than mirrors.)

RAIDZ IOPS is poor - so it should only be used for rarely accessed sequential data. Mirror IOPS is good, so it should be used for active data and random reads (zVols, iSCSI, databases, app files.
Sequential reads benefit from ZFS pre-fetch - often when you make a second and subsequent read request, the data has already been read into memory.

HOWEVER…

None of this is worth anything if you are doing synchronous writes when you don’t need to.

AND UNFORTUANTELY…

Reddit has a higher than average quantity of people giving incorrect or bad advice - including some people who state bluntly that RAIDZ performance is bad and you should never use RAIDZ and always use mirrors regardless of the cost - and that is just wrong and terrible advice.