URGENT: RAIDZ1 Pool UNAVAIL After Replace Attempt — All Disks Healthy, Labels Intact, Need Help Assembling Pool

MarkStokes · December 2, 2025, 10:24pm

Yes. I have ECC RAM installed.

The drive were running hot (about 60c). I have additional cooling in place now because of that

MarkStokes · December 3, 2025, 9:54am

I am NEARLY there after the work yesterday.

I can bring the pool back online, temporarily, but it won’t resilver because the dead disk Z4F123Z0 keeps disconnecting and reconnecting, causing TrueNas to get into an endless loop with its reconnecting.

I should be able to recover the pool with only the 9 stable disks, but if I pull the damaged disk it doesn’t rebuild.

Any ideas on how I can get that disk out and let it rebuild with the 9 stable disks?

kricka-kracka · December 3, 2025, 6:05pm

my thoughts :
to rule out a connectivity issue (or maybe an “overloaded HBA”) , you could try connecting the drive differently, maybe for example, via a SATA port directly on the motherboard (if possible in this case), and then see if the resilvering works.

Otherwise, I believe I’ve read somewhere that in cases where you don’t want to stress a failing HDD any further, you can first try to create an image of it using a tool like ddrescue. (so to clone the disk to another one)
If that works, and ZFS then accepts the cloned drive as the missing member, the resilvering should hopefully complete successfully.

Of course, if and how well this works will depend on what ddrescue can actually recover (what data can still be salvaged)
From my perspective, creating an image of a disk that is about to fail is certainly not a disadvantage.

I should note, however, that I haven’t done this myself, so caution and further research are definitely recommended.

HoneyBadger · December 3, 2025, 7:25pm

This would be my suggestion. Clone the failing drive using ddrescue - which can be set to have a higher and longer tolerance for timeouts or failures, and “continue after failure” - which you could then hopefully use this new cloned disk as the resilver member since it would carry the same partition table and GUID on it.

Hopefully the disk being in physically sound condition would mean that a resilver could/would complete.

This is pretty bad and could be the source of the bad data. I have a number of those same Constellation ES.3 drives as you (mine are mostly 2TB, but still the NM0034) and they start to get upset at 40C.

MarkStokes · December 3, 2025, 7:40pm

Cool. I am going to order a desktop HDD caddy so I can connect it to my windows PC or another Ubuntu box and then I can run the drive clone.

kricka-kracka · December 3, 2025, 7:57pm

I’m not sure whether you actually need a proper caddy, or if you could just hook it up “MacGyver-style”… (that would probably be my approach )
But try to keep the drive cool and handle it gently…
The ddrescue mentioned runs on Linux. I’m not sure if other cloning software (you brought up your WIN PC) would work, I’m a bit skeptical about that…
As @HoneyBadger already pointed out, ddrescue has special options to really try everything to read the drive.

MarkStokes · December 4, 2025, 6:48pm

@HoneyBadger

I now have 10x new (to me) 4TB drives.

But, I don’t think I can ddRescue the failed drive Z4F13LT5 because lsblk is showing it as capacity 0B. If you are wondering, this is not the original “dead” drive. But this is the one that is now giving me jip.

sdh           3.6T Z4F0RQ890000R633SJFD             ST4000NM0034
└─sdh1        3.6T                                  
sdi           3.6T Z4F0RZ560000R633RF5C             ST4000NM0034
└─sdi1        3.6T                                  
sdj           3.6T Z4F134Z00000R650BPWJ             ST4000NM0034
└─sdj1        3.6T                                  
sdk           3.6T Z4F0S02M0000R633RCYT             ST4000NM0034
└─sdk1        3.6T                                  
sdl           3.6T Z4F0YRAG0000R642C1R7             ST4000NM0034
└─sdl1        3.6T                                  
sdm             0B Z4F13LT50000R650BLK3             ST4000NM0034
sdn           3.6T Z4F0NL9N0000R628MC1V             ST4000NM0034
└─sdn1        3.6T                                  
sdo           3.6T Z4F0JX2B0000R632ZXB6             ST4000NM0034
└─sdo1        3.6T                                  
sdp           3.6T Z4F13N960000C6489W7H             ST4000NM0034
└─sdp1        3.6T                                  
sdq           3.6T Z4F12ZBH0000R5225DPM             ST4000NM0034
└─sdq1        3.6T                                  
sdr           3.6T S8500NLF0000J611RGY2             ST4000NM0031
sds           3.6T S8500VJC0000J617ZY8N             ST4000NM0031

So, I think my only option is to try and import the pool with the 9 working disks.

I am not sure if this helps:

root@NAS01[/dev]# zdb -l /dev/sd?
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3

HoneyBadger · December 4, 2025, 6:51pm

That’s never good. Does it respond to smartctl -a /dev/sdm?

You’ll need to query the partition itself, so zdb -l /dev/sdn1 for example.

MarkStokes · December 4, 2025, 6:52pm

root@NAS01[/dev]# smartctl -a /dev/sdm
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST4000NM0034
Revision:             E005
Compliance:           SPC-4
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c50085a594a7
Serial number:        Z4F13LT50000R650BLK3
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu Dec  4 18:52:01 2025 GMT
device is NOT READY (e.g. spun down, busy)
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

HoneyBadger · December 4, 2025, 7:30pm

If it’s truly not spinning, you’re out of luck here. If it’s blocked you could try removing/reinserting it to see if it wakes up.

MarkStokes · December 4, 2025, 7:47pm

Is it not possible to restore the pool with one failed drive? I thought it was meant to be able to support losing a single drive? I have 9 /10 operational.

HoneyBadger · December 4, 2025, 9:29pm

One failed drive, yes - but you appear to have data integrity errors across all of your drives as shown in the zpool status of post #34:

If these are errors in the data files then those are lost, but if the errors extend to important filesystem metadata then this could be what’s making the pool unimportable.

MarkStokes · December 4, 2025, 10:18pm

Ahh. I see. crap.

Thank you for your help. Its very much appreciated.

MarkStokes · December 4, 2025, 10:33pm

I am going to clone all 10 of the original disks to my new disks as I think the data is still there. Then I can play around with some recovery tools.

Topic		Replies	Views
TrueNAS Scale 25.04.0 Upgrade disaster TrueNAS General SCALE	79	1475	May 29, 2025
ZFS pool unmountable: invalid label TrueNAS General	32	289	September 5, 2025
SOLVED: All of my hard drives are in use but TrueNAS is showing 25 unassigned disks TrueNAS General SCALE	25	696	December 1, 2025
ZFS import crashing system TrueNAS General SCALE , Hardware , Import-problem	44	230	November 20, 2025
Woke up this morning to a failure! TrueNAS General CORE	40	402	October 13, 2025

URGENT: RAIDZ1 Pool UNAVAIL After Replace Attempt — All Disks Healthy, Labels Intact, Need Help Assembling Pool

Related topics