Yes. I have ECC RAM installed.
The drive were running hot (about 60c). I have additional cooling in place now because of that
Yes. I have ECC RAM installed.
The drive were running hot (about 60c). I have additional cooling in place now because of that
I am NEARLY there after the work yesterday.
I can bring the pool back online, temporarily, but it won’t resilver because the dead disk Z4F123Z0 keeps disconnecting and reconnecting, causing TrueNas to get into an endless loop with its reconnecting.
I should be able to recover the pool with only the 9 stable disks, but if I pull the damaged disk it doesn’t rebuild.
Any ideas on how I can get that disk out and let it rebuild with the 9 stable disks?
my thoughts :
to rule out a connectivity issue (or maybe an “overloaded HBA”) , you could try connecting the drive differently, maybe for example, via a SATA port directly on the motherboard (if possible in this case), and then see if the resilvering works.
Otherwise, I believe I’ve read somewhere that in cases where you don’t want to stress a failing HDD any further, you can first try to create an image of it using a tool like ddrescue. (so to clone the disk to another one)
If that works, and ZFS then accepts the cloned drive as the missing member, the resilvering should hopefully complete successfully.
Of course, if and how well this works will depend on what ddrescue can actually recover (what data can still be salvaged)
From my perspective, creating an image of a disk that is about to fail is certainly not a disadvantage.
I should note, however, that I haven’t done this myself, so caution and further research are definitely recommended.
This would be my suggestion. Clone the failing drive using ddrescue - which can be set to have a higher and longer tolerance for timeouts or failures, and “continue after failure” - which you could then hopefully use this new cloned disk as the resilver member since it would carry the same partition table and GUID on it.
Hopefully the disk being in physically sound condition would mean that a resilver could/would complete.
This is pretty bad and could be the source of the bad data. I have a number of those same Constellation ES.3 drives as you (mine are mostly 2TB, but still the NM0034) and they start to get upset at 40C.
Cool. I am going to order a desktop HDD caddy so I can connect it to my windows PC or another Ubuntu box and then I can run the drive clone.
I’m not sure whether you actually need a proper caddy, or if you could just hook it up “MacGyver-style”… (that would probably be my approach
)
But try to keep the drive cool and handle it gently…
The ddrescue mentioned runs on Linux. I’m not sure if other cloning software (you brought up your WIN PC) would work, I’m a bit skeptical about that…
As @HoneyBadger already pointed out, ddrescue has special options to really try everything to read the drive.
I now have 10x new (to me) 4TB drives.
But, I don’t think I can ddRescue the failed drive Z4F13LT5 because lsblk is showing it as capacity 0B. If you are wondering, this is not the original “dead” drive. But this is the one that is now giving me jip.
sdh 3.6T Z4F0RQ890000R633SJFD ST4000NM0034
└─sdh1 3.6T
sdi 3.6T Z4F0RZ560000R633RF5C ST4000NM0034
└─sdi1 3.6T
sdj 3.6T Z4F134Z00000R650BPWJ ST4000NM0034
└─sdj1 3.6T
sdk 3.6T Z4F0S02M0000R633RCYT ST4000NM0034
└─sdk1 3.6T
sdl 3.6T Z4F0YRAG0000R642C1R7 ST4000NM0034
└─sdl1 3.6T
sdm 0B Z4F13LT50000R650BLK3 ST4000NM0034
sdn 3.6T Z4F0NL9N0000R628MC1V ST4000NM0034
└─sdn1 3.6T
sdo 3.6T Z4F0JX2B0000R632ZXB6 ST4000NM0034
└─sdo1 3.6T
sdp 3.6T Z4F13N960000C6489W7H ST4000NM0034
└─sdp1 3.6T
sdq 3.6T Z4F12ZBH0000R5225DPM ST4000NM0034
└─sdq1 3.6T
sdr 3.6T S8500NLF0000J611RGY2 ST4000NM0031
sds 3.6T S8500VJC0000J617ZY8N ST4000NM0031
So, I think my only option is to try and import the pool with the 9 working disks.
I am not sure if this helps:
root@NAS01[/dev]# zdb -l /dev/sd?
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2
failed to unpack label 3
That’s never good. Does it respond to smartctl -a /dev/sdm?
You’ll need to query the partition itself, so zdb -l /dev/sdn1 for example.
root@NAS01[/dev]# smartctl -a /dev/sdm
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST4000NM0034
Revision: E005
Compliance: SPC-4
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c50085a594a7
Serial number: Z4F13LT50000R650BLK3
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Thu Dec 4 18:52:01 2025 GMT
device is NOT READY (e.g. spun down, busy)
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
If it’s truly not spinning, you’re out of luck here. If it’s blocked you could try removing/reinserting it to see if it wakes up.
Is it not possible to restore the pool with one failed drive? I thought it was meant to be able to support losing a single drive? I have 9 /10 operational.
One failed drive, yes - but you appear to have data integrity errors across all of your drives as shown in the zpool status of post #34:
If these are errors in the data files then those are lost, but if the errors extend to important filesystem metadata then this could be what’s making the pool unimportable.
Ahh. I see. crap.
Thank you for your help. Its very much appreciated.
I am going to clone all 10 of the original disks to my new disks as I think the data is still there. Then I can play around with some recovery tools.