I moved drives from my Supermicro internal bays to a NetApp DS4246 disk shelf, as well as adding some new drives. After installing it all some reported 0B. With some drives this seemed to be a 3.3v issue, but with others it just seemed to not like the diskshelf. I moved some of these around to get the most drives working. I did this step stupidly while it was on which may have caused issues. After finishing the migration I restarted the machine and now my existing pool was offline. Multiple import attempts were made were, which may have corrupted the pool metadata. Now 13 of 14 original drives are healthy and readable, but the pool won’t import.
I have older valid uberblocks on 3 drives from before the corruption happened. I’m hoping someone here has experience recovering dRAID pools in this state.
System Details
- TrueNAS SCALE 24.10 (Electric Eel), OpenZFS 2.2.x
- Also attempted recovery from Ubuntu Server 24.04 live USB with zfs-kmod-2.3.4
- Supermicro chassis with internal LSI/mpt3sas HBA + NetApp DS4246 via LSI 9207-8e
Pool Layout
- Type: dRAID3, 8 data disks per group, 14 children, 2 distributed spares,
- SLOG: Samsung 990 Pro 2TB NVMe
- Data: ~95 TiB used (93.73 TiB at last healthy state), mostly media files plus personal photos and family videos. All are somewhat recoverable. Personal photos are mostly also on google photos and family videos are actively being digitized so just some duplicated work.
- Age: Pool has been running stable for about a year and a half prior to this incident
Current Pool Status
13 of 14 data drives show ONLINE. 1 drive somehow stopped working in the transfer. Both distributed spares were already INUSE from 2 earlier drive failures. Part of the reason I wanted to add the new drives.
pool: General HDD Pool
id: 1067285057000815225
state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
config:
General HDD Pool FAULTED corrupted data
draid3:8d:14c:2s-1 DEGRADED
8a986844-b426-406c-a3c4-77bcc3d8f2d6 ONLINE
1fb470ae-1ac3-4272-9743-4799ecd67e81 ONLINE
72e4c576-337f-459d-842d-3ca7c2f9d6a3 ONLINE
8e1ce784-3912-4ccc-bc8d-e41aace9f644 ONLINE
draid3-1-1 ONLINE
57ae2ee9-9544-4ead-9144-f82df0df0280 ONLINE
f07de474-5e79-4b22-9673-783319961e70 ONLINE
db52d223-9bc3-49ec-9564-84f7f6343a56 ONLINE
cea85f43-95f5-4415-8e9e-906bc471848d ONLINE
ff134e3b-6d20-4427-a3ae-9326cefe2cdb UNAVAIL
ce1be0cc-14c9-4db7-bed1-a10f9ecea31b ONLINE
d3c46001-701a-42a8-b0be-226ec3019309 ONLINE
3b3e802c-2713-4562-8fa9-6fe041f1584b ONLINE
draid3-1-0 ONLINE
logs
nvme-Samsung_SSD_990_PRO_2TB_S7L9NJ0Y102160H ONLINE
What Happened (Detailed Timeline)
- March 8 afternoon: Began migrating drives from internal Supermicro chassis to NetApp DS4246 disk shelf to free internal bays for SSDs.
- March 8 evening: Discovered many drives showing 0B in the disk shelf. HBA saw the interposers but SATA drives behind them not communicating. Identified as SATA 3.3V power disable issue (pin 3). Applied kapton tape to pins 1-3 and fixed one drive, but not others
- March 8-9: Shuffled drives between internal bays and shelf multiple times. Some drives only work internally, some work in both. Pool was importable earlier in this process (showed DEGRADED with fewer UNAVAIL drives).
- March 10 ~09:07-09:09 UTC: Multiple import attempts were made while CRC-erroring drives were on the bus. These may have written corrupted metadata/uberblocks to drives.
- March 10: One
zpool import -f -o readonly=onattempt caused a system reboot. - March 10 evening: Booted Ubuntu Server live USB (zfs-kmod-2.3.4) for recovery attempt. Basic
zpool importscan with relaxed kernel parameters hung for 45+ minutes
Uberblock Situation
Most drives had their uberblocks overwritten during the failed import attempts on March 10. Only 3 drives retain older uberblocks.
Drives with overwritten uberblocks (all from March 10):
sdc1, sdd1, sde1, sdg1, sdq1, sdu1, sdv1, sdab1:
txg range: 5861020 - 5861051
timestamps: Tue Mar 10 09:07:20 - 09:09:59 2026
Drives with valid older uberblocks (from before corruption):
sdr1: oldest txg = 5851160, timestamp = Sun Mar 8 16:28:59 2026
sdy1: oldest txg = 5851160
sdaa1: oldest txg = 5851160
Label txg (from zdb -l): 5852223
The gap between old and new uberblocks is ~9,860 transaction groups (~14 hours of writes at 5 sec/txg).
I have backed up the labels (first and last 512KB) from sdr1, sdy1, and sdaa1 to a separate SSD pool.
Drive Label Status
Partitioned drives (zdb -l /dev/sdX1 readable): sdc, sdd, sde, sdg, sdq, sdr, sdu, sdv, sdy, sdaa, sdab - all show pool name, correct pool GUID, dRAID vdev structure
Whole-disk drives (zdb -l /dev/sdX fails): sda, sdb, sdf, sdi, sdj, sdk, sdm, sdo, sds - zdb -l returns “failed to unpack label” on all 4 labels, BUT zpool import still sees these drives as ONLINE pool members
SLOG (nvme):
sudo zdb -l /dev/nvme1n1:
name: 'General HDD Pool'
txg: 5852223
pool_guid: 1067285057000815225
is_log: 1
guid: 6619869925875120406
NVMe device numbering changed across reboots (was nvme1n1, became nvme2n1, then back). SLOG serial is S7L9NJ0Y102160H - verified via nvme list.
zdb Pool Config Output
sudo zdb -e -p /dev/disk/by-id "General HDD Pool":
vdev_children: 3
children[0]:
type: 'hole'
id: 0
guid: 0
children[1]:
type: 'draid'
id: 1
guid: 13538714567898865848
nparity: 3
draid_ndata: 8
draid_nspares: 2
draid_ngroups: 12
metaslab_array: 128
metaslab_shift: 34
ashift: 12
asize: 144000925302784
is_log: 0
create_txg: 4
children[0]:
type: 'disk'
id: 0
guid: 3651093111097855860
whole_disk: 0
not_present: 1
DTL: 2857
create_txg: 4
Hole is probably from when I removed the previous slog and replaced it with a new slog that had been the L2ARC. This had been done a day or two before the incident.
sudo zdb -e -C "General HDD Pool":
→ "can't open 'General HDD Pool': Input/output error"
Complete List of All Import Attempts
Phase 1: On TrueNAS SCALE
| # | Command | Result |
|---|---|---|
| 1 | zpool status "General HDD Pool" |
“cannot open: no such pool” |
| 2 | zpool import (scan) |
Pool shown as UNAVAIL, insufficient replicas (many drives missing) |
| 3 | zpool import "General HDD Pool" |
UNAVAIL, insufficient replicas |
| 4 | zpool import -f "General HDD Pool" |
I/O error |
| 5 | zpool import -f -F "General HDD Pool" |
I/O error |
| 6 | zpool import -f -m "General HDD Pool" |
I/O error |
| 7 | zpool import -f -F -m "General HDD Pool" |
I/O error |
| 8 | zpool import -f -o readonly=on "General HDD Pool" |
Kernel panic / system reboot |
| 9 | zpool import -f -F -n "General HDD Pool" |
Empty output (dry run, no error but no output) |
| 10 | zpool import -f -F -T 50 "General HDD Pool" |
one or more devices is currently unavailable |
| 11 | zpool import -f -F -T 200 "General HDD Pool" |
one or more devices is currently unavailable |
| 12 | zpool import -f -F -T 10000 "General HDD Pool" |
one or more devices is currently unavailable |
| 13 | zpool import -f -F -T 15000 -d /dev/sdr1 -d /dev/sdy1 -d /dev/sdaa1 "General HDD Pool" |
one or more devices is currently unavailable |
| 14 | zpool import -f -F -m -T 10000 "General HDD Pool" |
one or more devices is currently unavailable |
| 15 | zpool import -f -F -m -o readonly=on -c /dev/disk/by-id "General HDD Pool" |
failed to read cache file contents: invalid or missing cache file / no such pool available |
| 16 | zpool import -f -F -m -d /dev/disk/by-id "General HDD Pool" |
I/O error |
Phase 3: On TrueNAS SCALE - With Relaxed Kernel Parameters
echo 1 | sudo tee /sys/module/zfs/parameters/zfs_recover
echo 0 | sudo tee /sys/module/zfs/parameters/spa_load_verify_metadata
echo 0 | sudo tee /sys/module/zfs/parameters/spa_load_verify_data
echo 1 | sudo tee /sys/module/zfs/parameters/zfs_max_missing_tvds
| # | Command | Result |
|---|---|---|
| 17 | zpool import -f -F -m -o readonly=on -T 5851160 "General HDD Pool" |
I/O error |
| 18 | zpool import -f -F -m -o readonly=on "General HDD Pool" |
Kernel panic / system reboot |
Phase 4: On Ubuntu Server 24.04 Live USB - zfs-kmod-2.3.4
Same relaxed kernel parameters set and verified. ZFS version: zfs-2.2.2-0ubuntu9.4, zfs-kmod-2.3.4-1ubuntu2
| # | Command | Result |
|---|---|---|
| 19 | zpool import (basic device scan) |
Hung for 45+ minutes, had to be killed |
Complete Diagnostic Commands Run
# Drive enumeration (run many times as drives were shuffled)
sudo lsblk -d -o NAME,SIZE,SERIAL | sort -k2 -h
sudo lsscsi
sudo sas2ircu 0 display | grep -B2 -A15 "Device is a Hard"
sudo fdisk -l /dev/sda
# NVMe device mapping (changed across reboots)
sudo nvme list
ls -la /dev/disk/by-id/ | grep nvme
# SCSI bus rescans (after each drive change)
echo "- - -" | sudo tee /sys/class/scsi_host/host0/scan
echo "- - -" | sudo tee /sys/class/scsi_host/host1/scan
echo "- - -" | sudo tee /sys/class/scsi_host/host11/scan
# Error checking (run many times)
sudo dmesg | grep -i "error|CRC|FAILED"
sudo dmesg | grep -iE '(error|i/o|fail|reset|offline)'
sudo dmesg -C (clear, then check after import attempt)
# SMART / drive health
sudo smartctl -a /dev/nvme1n1
sudo hdparm -I /dev/sdt
# Drive identification
sudo blkid /dev/sdy
# Label inspection - partitioned drives
sudo zdb -l /dev/sdc1 → readable, shows pool metadata
sudo zdb -l /dev/sdr1 → readable
sudo zdb -l /dev/nvme1n1 → readable, SLOG label with is_log=1
# Label inspection - whole-disk drives
sudo zdb -l /dev/sda → "failed to unpack label" (x4)
sudo zdb -l /dev/sdb → "failed to unpack label" (x4)
# Uberblock inspection
sudo zdb -lu /dev/sda → empty (whole-disk, no partition)
sudo zdb -lu /dev/sdb → empty
sudo zdb -lu /dev/sdc1 → uberblocks all from Mar 10 (5861020-5861051)
sudo zdb -lu /dev/sdr1 → older uberblocks from Mar 8 (txg 5851160)
# Searching for UNAVAIL drive GUID across all connected devices
sudo zdb -l /dev/sd{a..z} 2>&1 | grep -B5 "ff134e3b" → not found
sudo zdb -l /dev/sda{a..e} 2>&1 | grep -B5 "ff134e3b" → not found
# Uberblock survey across ALL pool member drives
for d in sda sdb sdc sdd sde sdf sdg sdi sdj sdk sdm sdo sdq sdr sds sdu sdv sdy sdaa sdab; do
part="${d}1"
if [ -b "/dev/$part" ]; then
echo "=== $d (partitioned) ==="
sudo zdb -lu /dev/$part 2>&1 | grep "txg = " | grep -v checkpoint | sort -t= -k2 -n | head -3
else
echo "=== $d (whole disk) ==="
sudo zdb -lu /dev/$d 2>&1 | grep "txg = " | grep -v checkpoint | sort -t= -k2 -n | head -3
fi
done
# Results:
# sdc1, sdd1, sde1, sdg1, sdq1, sdu1, sdv1, sdab1: txg 5861020+ (Mar 10, corrupted)
# sdr1, sdy1, sdaa1: txg 5851160 (Mar 8, valid)
# sda, sdb, sdf, sdi, sdj, sdo, sds: empty (whole-disk, no readable uberblocks)
# sdk, sdm: txg 0
# Pool metadata inspection
sudo zdb -e -p /dev/disk/by-id "General HDD Pool" → printed config structure (see above)
sudo zdb -e -C "General HDD Pool" → "can't open: Input/output error"
# Label backup of 3 golden drives
for d in sdr1 sdy1 sdaa1; do
sudo dd if=/dev/$d of=/mnt/SSD-Pool/${d}-label-start.bin bs=512K count=1
SIZE=$(sudo blockdev --getsize64 /dev/$d)
sudo dd if=/dev/$d of=/mnt/SSD-Pool/${d}-label-end.bin bs=512K count=1 skip=$(( (SIZE / 524288) - 1 ))
done
Key Observations
- Without
-Trewind: fails with “I/O error / corrupted data” - the current (corrupted) uberblocks at txg ~5861020 are unusable - With
-Trewind to older txg: fails with “one or more devices is currently unavailable” - ZFS finds valid older uberblocks at txg ~5851160 but refuses to do a degraded rewind import, possibly a dRAID-specific limitation - Readonly import causes kernel panics on both TrueNAS (OpenZFS 2.2.x) and Ubuntu live USB (zfs-kmod-2.3.4) - the MOS corruption crashes the ZFS kernel module
- Relaxed kernel parameters (zfs_recover=1, disabled verification, zfs_max_missing_tvds=1) did not change the outcome
- The bus is now completely clean - no CRC errors in dmesg after isolating the bad backplane port
- Whole-disk drives (no partition table) show “failed to unpack label” via
zdb -lbut are still recognized as ONLINE byzpool import- unclear if this is normal for dRAID or indicates label damage - Two separate kernel panics occurred during readonly import attempts on two different ZFS versions, suggesting the corruption is in the MOS itself, not just uberblock selection
- Basic
zpool importscan hangs for 45+ minutes on Ubuntu live USB - possibly a drive causing SCSI timeouts that aren’t visible in dmesg as CRC errors
What I Have NOT Done
- Destroyed the pool
- Run
zpool labelclearon any drive - Written to any pool member drive since the corruption
- Run
zpool clearon the pool (it’s not imported) - Attempted
ddof labels between drives
I Have backed up labels (first + last 512KB) from the 3 drives with old uberblocks to a separate SSD pool
Resolution?
Ideally I would like to import the pool in degraded mode (13/14 drives) using the older uberblocks from txg ~5851160 (March 8). The ~10,000 transaction gap represents approximately 14 hours of writes (5 seconds per txg) but nothing of that matters to me was written during that time. If only some of it is recoverable that is fine too. But worst case sceario I can handle the data loss. Not quite sure why I chose draid, realizing probably wasn’t the best choice.