FAULTED dRAID3: 13/14 Drives Up but Metadata Corrupted? Cannot Import

I moved drives from my Supermicro internal bays to a NetApp DS4246 disk shelf, as well as adding some new drives. After installing it all some reported 0B. With some drives this seemed to be a 3.3v issue, but with others it just seemed to not like the diskshelf. I moved some of these around to get the most drives working. I did this step stupidly while it was on which may have caused issues. After finishing the migration I restarted the machine and now my existing pool was offline. Multiple import attempts were made were, which may have corrupted the pool metadata. Now 13 of 14 original drives are healthy and readable, but the pool won’t import.

I have older valid uberblocks on 3 drives from before the corruption happened. I’m hoping someone here has experience recovering dRAID pools in this state.

System Details

  • TrueNAS SCALE 24.10 (Electric Eel), OpenZFS 2.2.x
  • Also attempted recovery from Ubuntu Server 24.04 live USB with zfs-kmod-2.3.4
  • Supermicro chassis with internal LSI/mpt3sas HBA + NetApp DS4246 via LSI 9207-8e

Pool Layout

  • Type: dRAID3, 8 data disks per group, 14 children, 2 distributed spares,
  • SLOG: Samsung 990 Pro 2TB NVMe
  • Data: ~95 TiB used (93.73 TiB at last healthy state), mostly media files plus personal photos and family videos. All are somewhat recoverable. Personal photos are mostly also on google photos and family videos are actively being digitized so just some duplicated work.
  • Age: Pool has been running stable for about a year and a half prior to this incident

Current Pool Status

13 of 14 data drives show ONLINE. 1 drive somehow stopped working in the transfer. Both distributed spares were already INUSE from 2 earlier drive failures. Part of the reason I wanted to add the new drives.

pool: General HDD Pool
  id: 1067285057000815225
state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
        The pool may be active on another system, but can be imported using
        the '-f' flag.
config:
    General HDD Pool                          FAULTED  corrupted data
      draid3:8d:14c:2s-1                      DEGRADED
        8a986844-b426-406c-a3c4-77bcc3d8f2d6  ONLINE
        1fb470ae-1ac3-4272-9743-4799ecd67e81  ONLINE
        72e4c576-337f-459d-842d-3ca7c2f9d6a3  ONLINE
        8e1ce784-3912-4ccc-bc8d-e41aace9f644  ONLINE
        draid3-1-1                            ONLINE
        57ae2ee9-9544-4ead-9144-f82df0df0280  ONLINE
        f07de474-5e79-4b22-9673-783319961e70  ONLINE
        db52d223-9bc3-49ec-9564-84f7f6343a56  ONLINE
        cea85f43-95f5-4415-8e9e-906bc471848d  ONLINE
        ff134e3b-6d20-4427-a3ae-9326cefe2cdb  UNAVAIL
        ce1be0cc-14c9-4db7-bed1-a10f9ecea31b  ONLINE
        d3c46001-701a-42a8-b0be-226ec3019309  ONLINE
        3b3e802c-2713-4562-8fa9-6fe041f1584b  ONLINE
        draid3-1-0                            ONLINE
    logs
      nvme-Samsung_SSD_990_PRO_2TB_S7L9NJ0Y102160H  ONLINE

What Happened (Detailed Timeline)

  1. March 8 afternoon: Began migrating drives from internal Supermicro chassis to NetApp DS4246 disk shelf to free internal bays for SSDs.
  2. March 8 evening: Discovered many drives showing 0B in the disk shelf. HBA saw the interposers but SATA drives behind them not communicating. Identified as SATA 3.3V power disable issue (pin 3). Applied kapton tape to pins 1-3 and fixed one drive, but not others
  3. March 8-9: Shuffled drives between internal bays and shelf multiple times. Some drives only work internally, some work in both. Pool was importable earlier in this process (showed DEGRADED with fewer UNAVAIL drives).
  4. March 10 ~09:07-09:09 UTC: Multiple import attempts were made while CRC-erroring drives were on the bus. These may have written corrupted metadata/uberblocks to drives.
  5. March 10: One zpool import -f -o readonly=on attempt caused a system reboot.
  6. March 10 evening: Booted Ubuntu Server live USB (zfs-kmod-2.3.4) for recovery attempt. Basic zpool import scan with relaxed kernel parameters hung for 45+ minutes

Uberblock Situation

Most drives had their uberblocks overwritten during the failed import attempts on March 10. Only 3 drives retain older uberblocks.

Drives with overwritten uberblocks (all from March 10):

sdc1, sdd1, sde1, sdg1, sdq1, sdu1, sdv1, sdab1:
  txg range: 5861020 - 5861051
  timestamps: Tue Mar 10 09:07:20 - 09:09:59 2026

Drives with valid older uberblocks (from before corruption):

sdr1:  oldest txg = 5851160, timestamp = Sun Mar  8 16:28:59 2026
sdy1:  oldest txg = 5851160
sdaa1: oldest txg = 5851160

Label txg (from zdb -l): 5852223

The gap between old and new uberblocks is ~9,860 transaction groups (~14 hours of writes at 5 sec/txg).

I have backed up the labels (first and last 512KB) from sdr1, sdy1, and sdaa1 to a separate SSD pool.

Drive Label Status

Partitioned drives (zdb -l /dev/sdX1 readable): sdc, sdd, sde, sdg, sdq, sdr, sdu, sdv, sdy, sdaa, sdab - all show pool name, correct pool GUID, dRAID vdev structure

Whole-disk drives (zdb -l /dev/sdX fails): sda, sdb, sdf, sdi, sdj, sdk, sdm, sdo, sds - zdb -l returns “failed to unpack label” on all 4 labels, BUT zpool import still sees these drives as ONLINE pool members

SLOG (nvme):

sudo zdb -l /dev/nvme1n1:
  name: 'General HDD Pool'
  txg: 5852223
  pool_guid: 1067285057000815225
  is_log: 1
  guid: 6619869925875120406

NVMe device numbering changed across reboots (was nvme1n1, became nvme2n1, then back). SLOG serial is S7L9NJ0Y102160H - verified via nvme list.

zdb Pool Config Output

sudo zdb -e -p /dev/disk/by-id "General HDD Pool":

vdev_children: 3
children[0]:
    type: 'hole'
    id: 0
    guid: 0
children[1]:
    type: 'draid'
    id: 1
    guid: 13538714567898865848
    nparity: 3
    draid_ndata: 8
    draid_nspares: 2
    draid_ngroups: 12
    metaslab_array: 128
    metaslab_shift: 34
    ashift: 12
    asize: 144000925302784
    is_log: 0
    create_txg: 4
    children[0]:
        type: 'disk'
        id: 0
        guid: 3651093111097855860
        whole_disk: 0
        not_present: 1
        DTL: 2857
        create_txg: 4

Hole is probably from when I removed the previous slog and replaced it with a new slog that had been the L2ARC. This had been done a day or two before the incident.

sudo zdb -e -C "General HDD Pool":
→ "can't open 'General HDD Pool': Input/output error"

Complete List of All Import Attempts

Phase 1: On TrueNAS SCALE

# Command Result
1 zpool status "General HDD Pool" “cannot open: no such pool”
2 zpool import (scan) Pool shown as UNAVAIL, insufficient replicas (many drives missing)
3 zpool import "General HDD Pool" UNAVAIL, insufficient replicas
4 zpool import -f "General HDD Pool" I/O error
5 zpool import -f -F "General HDD Pool" I/O error
6 zpool import -f -m "General HDD Pool" I/O error
7 zpool import -f -F -m "General HDD Pool" I/O error
8 zpool import -f -o readonly=on "General HDD Pool" Kernel panic / system reboot
9 zpool import -f -F -n "General HDD Pool" Empty output (dry run, no error but no output)
10 zpool import -f -F -T 50 "General HDD Pool" one or more devices is currently unavailable
11 zpool import -f -F -T 200 "General HDD Pool" one or more devices is currently unavailable
12 zpool import -f -F -T 10000 "General HDD Pool" one or more devices is currently unavailable
13 zpool import -f -F -T 15000 -d /dev/sdr1 -d /dev/sdy1 -d /dev/sdaa1 "General HDD Pool" one or more devices is currently unavailable
14 zpool import -f -F -m -T 10000 "General HDD Pool" one or more devices is currently unavailable
15 zpool import -f -F -m -o readonly=on -c /dev/disk/by-id "General HDD Pool" failed to read cache file contents: invalid or missing cache file / no such pool available
16 zpool import -f -F -m -d /dev/disk/by-id "General HDD Pool" I/O error

Phase 3: On TrueNAS SCALE - With Relaxed Kernel Parameters

echo 1 | sudo tee /sys/module/zfs/parameters/zfs_recover
echo 0 | sudo tee /sys/module/zfs/parameters/spa_load_verify_metadata
echo 0 | sudo tee /sys/module/zfs/parameters/spa_load_verify_data
echo 1 | sudo tee /sys/module/zfs/parameters/zfs_max_missing_tvds
# Command Result
17 zpool import -f -F -m -o readonly=on -T 5851160 "General HDD Pool" I/O error
18 zpool import -f -F -m -o readonly=on "General HDD Pool" Kernel panic / system reboot

Phase 4: On Ubuntu Server 24.04 Live USB - zfs-kmod-2.3.4

Same relaxed kernel parameters set and verified. ZFS version: zfs-2.2.2-0ubuntu9.4, zfs-kmod-2.3.4-1ubuntu2

# Command Result
19 zpool import (basic device scan) Hung for 45+ minutes, had to be killed

Complete Diagnostic Commands Run

# Drive enumeration (run many times as drives were shuffled)
sudo lsblk -d -o NAME,SIZE,SERIAL | sort -k2 -h
sudo lsscsi
sudo sas2ircu 0 display | grep -B2 -A15 "Device is a Hard"
sudo fdisk -l /dev/sda

# NVMe device mapping (changed across reboots)
sudo nvme list
ls -la /dev/disk/by-id/ | grep nvme

# SCSI bus rescans (after each drive change)
echo "- - -" | sudo tee /sys/class/scsi_host/host0/scan
echo "- - -" | sudo tee /sys/class/scsi_host/host1/scan
echo "- - -" | sudo tee /sys/class/scsi_host/host11/scan

# Error checking (run many times)
sudo dmesg | grep -i "error|CRC|FAILED"
sudo dmesg | grep -iE '(error|i/o|fail|reset|offline)'
sudo dmesg -C  (clear, then check after import attempt)

# SMART / drive health
sudo smartctl -a /dev/nvme1n1
sudo hdparm -I /dev/sdt

# Drive identification
sudo blkid /dev/sdy

# Label inspection - partitioned drives
sudo zdb -l /dev/sdc1         → readable, shows pool metadata
sudo zdb -l /dev/sdr1         → readable
sudo zdb -l /dev/nvme1n1      → readable, SLOG label with is_log=1

# Label inspection - whole-disk drives
sudo zdb -l /dev/sda           → "failed to unpack label" (x4)
sudo zdb -l /dev/sdb           → "failed to unpack label" (x4)

# Uberblock inspection
sudo zdb -lu /dev/sda          → empty (whole-disk, no partition)
sudo zdb -lu /dev/sdb          → empty
sudo zdb -lu /dev/sdc1         → uberblocks all from Mar 10 (5861020-5861051)
sudo zdb -lu /dev/sdr1         → older uberblocks from Mar 8 (txg 5851160)

# Searching for UNAVAIL drive GUID across all connected devices
sudo zdb -l /dev/sd{a..z} 2>&1 | grep -B5 "ff134e3b"       → not found
sudo zdb -l /dev/sda{a..e} 2>&1 | grep -B5 "ff134e3b"      → not found

# Uberblock survey across ALL pool member drives
for d in sda sdb sdc sdd sde sdf sdg sdi sdj sdk sdm sdo sdq sdr sds sdu sdv sdy sdaa sdab; do
    part="${d}1"
    if [ -b "/dev/$part" ]; then
        echo "=== $d (partitioned) ==="
        sudo zdb -lu /dev/$part 2>&1 | grep "txg = " | grep -v checkpoint | sort -t= -k2 -n | head -3
    else
        echo "=== $d (whole disk) ==="
        sudo zdb -lu /dev/$d 2>&1 | grep "txg = " | grep -v checkpoint | sort -t= -k2 -n | head -3
    fi
done
# Results:
#   sdc1, sdd1, sde1, sdg1, sdq1, sdu1, sdv1, sdab1: txg 5861020+ (Mar 10, corrupted)
#   sdr1, sdy1, sdaa1: txg 5851160 (Mar 8, valid)
#   sda, sdb, sdf, sdi, sdj, sdo, sds: empty (whole-disk, no readable uberblocks)
#   sdk, sdm: txg 0

# Pool metadata inspection
sudo zdb -e -p /dev/disk/by-id "General HDD Pool"    → printed config structure (see above)
sudo zdb -e -C "General HDD Pool"                     → "can't open: Input/output error"

# Label backup of 3 golden drives
for d in sdr1 sdy1 sdaa1; do
    sudo dd if=/dev/$d of=/mnt/SSD-Pool/${d}-label-start.bin bs=512K count=1
    SIZE=$(sudo blockdev --getsize64 /dev/$d)
    sudo dd if=/dev/$d of=/mnt/SSD-Pool/${d}-label-end.bin bs=512K count=1 skip=$(( (SIZE / 524288) - 1 ))
done

Key Observations

  1. Without -T rewind: fails with “I/O error / corrupted data” - the current (corrupted) uberblocks at txg ~5861020 are unusable
  2. With -T rewind to older txg: fails with “one or more devices is currently unavailable” - ZFS finds valid older uberblocks at txg ~5851160 but refuses to do a degraded rewind import, possibly a dRAID-specific limitation
  3. Readonly import causes kernel panics on both TrueNAS (OpenZFS 2.2.x) and Ubuntu live USB (zfs-kmod-2.3.4) - the MOS corruption crashes the ZFS kernel module
  4. Relaxed kernel parameters (zfs_recover=1, disabled verification, zfs_max_missing_tvds=1) did not change the outcome
  5. The bus is now completely clean - no CRC errors in dmesg after isolating the bad backplane port
  6. Whole-disk drives (no partition table) show “failed to unpack label” via zdb -l but are still recognized as ONLINE by zpool import - unclear if this is normal for dRAID or indicates label damage
  7. Two separate kernel panics occurred during readonly import attempts on two different ZFS versions, suggesting the corruption is in the MOS itself, not just uberblock selection
  8. Basic zpool import scan hangs for 45+ minutes on Ubuntu live USB - possibly a drive causing SCSI timeouts that aren’t visible in dmesg as CRC errors

What I Have NOT Done

  • Destroyed the pool
  • Run zpool labelclear on any drive
  • Written to any pool member drive since the corruption
  • Run zpool clear on the pool (it’s not imported)
  • Attempted dd of labels between drives

I Have backed up labels (first + last 512KB) from the 3 drives with old uberblocks to a separate SSD pool

Resolution?

Ideally I would like to import the pool in degraded mode (13/14 drives) using the older uberblocks from txg ~5851160 (March 8). The ~10,000 transaction gap represents approximately 14 hours of writes (5 seconds per txg) but nothing of that matters to me was written during that time. If only some of it is recoverable that is fine too. But worst case sceario I can handle the data loss. Not quite sure why I chose draid, realizing probably wasn’t the best choice.

Indeed… 14c:8d:2s does not even divide nicely.

You have already attempted a lot—possibly aggravating the damage. Unless @HoneyBadger has an idea to recover, I’m afraid the pool is lost and I’m not sure that Klennet ZFS Recovery or UFS Explorer RAID Recovery even support dRAID (although trying would be free).

I don’t think those two recovery software were even supporting Raid-Z Expansion. If I recall a forum post, Klennet was failing in trying to recover a pool that had used Raid-Z Expansion.