Recovering a failed VDEV to make it available to the Storage Pool

Quick background. Apparently, the last admin Retired in Place and has since departed. I reviewed the TrueNAS server when I was hired and found 7 drives had gone offline since 2023. Unfortunately, one of the 5 RAIDZ2 devices had lost 3 drives so the VDEV was lost.

System: TrueNAS-12.0-U6.1
Hardware: Supermicro: 64G ram, 36 12 TB drives, Xeon E3-1240 CPUs.

I’ve since replaced 7 of the failed drives (and another failed while I was working to get the original 7 replaced) including the three failed drives in RAIDZ2 #4. The other RAIDZ2 devices have recovered and are not in a DEGRADED state.

After the third drive was replaced on RAIDZ2 #4, the other drives changed to DEGRADED and I see now that one of the replaced drives also changed to DEGRADED. No drive errors but all have a generated Checksum.

Any thoughts on getting the drives out of the DEGRADED state and bringing the VDEV back to the storage pool?

Welcome to the forums!

Checksum errors are frequently attributable to cable issues - standard suggestion for first action is to inspect the drive data cables and connectors,. possibly swap cables between good drives and bad drives and see if error follows the cables,. etc. Replace cables as necessary.

To help, we’ll need more information on the pool layout, drives and controllers/HBAs.
For a start, the outputs of some commands, formatted using the </> button please:
zpool status
camcontrol devlist
sas2flash -list
sas3flash -list

2 Likes

I guess the main thing here is nothing changed other than I replaced the bad drives and all the rest of the drives showed up as DEGRADED. I can see it if it was a single drive that kicked out an error but all the remaining drives and only for that VDEV?

The top too many errors is the one I’m currently replacing. The new drive is being wiped now.

The three (resilvering) ones are drives I’ve successfully replaced.

raidz2-3 was the one with three failed drives (the two ONLINE ones now and the second from the bottom one). When the third drive was replaced, the rest of the drives as listed changed to DEGRADED.

type # zpool status
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:07 with 0 errors on Thu Aug 21 03:45:07 2025
config:

	NAME        STATE     READ WRITE CKSUM
	boot-pool   ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    ada0p2  ONLINE       0     0     0
	    ada1p2  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Aug 23 18:02:59 2025
	236T scanned at 1.50G/s, 234T issued at 1.49G/s, 284T total
	13.7T resilvered, 82.40% done, 09:34:02 to go
config:

	NAME                                              STATE     READ WRITE CKSUM
	tank                                              DEGRADED     0     0     0
	  raidz2-0                                        ONLINE       0     0     0
	    gptid/c39b69a7-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/c3bb9d40-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/c3c7d861-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/c4042224-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/c4092f25-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/c47f45b9-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/0a52cb3d-7dd5-11f0-a01e-ac1f6b7d100c    ONLINE       0     0     0
	  raidz2-1                                        DEGRADED     0     0     0
	    gptid/c46f76f6-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/c4a8c25e-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/c48e7767-1950-11eb-9178-ac1f6b7d100c    FAULTED    392     1     0  too many errors
	    gptid/c52c9126-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/c55f4696-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/86b9eae4-7ec7-11f0-a01e-ac1f6b7d100c    ONLINE       0     0     0  (resilvering)
	    gptid/c63e5bf2-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	  raidz2-2                                        ONLINE       0     0     0
	    gptid/c600a2c3-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0   223
	    gptid/c66a6429-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/45e43bdd-7feb-11f0-a01e-ac1f6b7d100c    ONLINE       0     0     0  (resilvering)
	    gptid/c7e9b7e1-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/c8cc889e-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/c8dc0f22-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/c9557f5a-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	  raidz2-3                                        DEGRADED     0     0     0
	    gptid/143c8ff3-7d09-11f0-a01e-ac1f6b7d100c    ONLINE       0     0    54
	    gptid/c9c9d5be-1950-11eb-9178-ac1f6b7d100c    DEGRADED     0     0   242  too many errors
	    gptid/ca026686-1950-11eb-9178-ac1f6b7d100c    DEGRADED     0     0   242  too many errors
	    gptid/485f5d83-7d09-11f0-a01e-ac1f6b7d100c    ONLINE       0     0    72
	    gptid/cabd4f5b-1950-11eb-9178-ac1f6b7d100c    DEGRADED     0     0   243  too many errors
	    gptid/11f34142-76c0-11f0-a01e-ac1f6b7d100c    DEGRADED     0     0   198  too many errors
	    gptid/cadc9345-1950-11eb-9178-ac1f6b7d100c    DEGRADED     0     0   243  too many errors
	  raidz2-4                                        ONLINE       0     0     0
	    gptid/cb13fa4b-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/cb5f4f68-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/cb9083c8-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    spare-3                                       ONLINE       0     0    56
	      gptid/fb40d09c-81bb-11f0-a01e-ac1f6b7d100c  ONLINE       0     0     0  (resilvering)
	      gptid/cc1f595f-1950-11eb-9178-ac1f6b7d100c  ONLINE       0     0     0
	    gptid/cbdc73c1-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/cbe211e7-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	    gptid/cc24aa76-1950-11eb-9178-ac1f6b7d100c    ONLINE       0     0     0
	spares
	  gptid/cc1f595f-1950-11eb-9178-ac1f6b7d100c      INUSE     currently in use

errors: 8 data errors, use '-v' for a list
or paste code here
# camcontrol devlist
<ATA HGST HUH721212AL T3D0>        at scbus1 target 4 lun 0 (pass0,da0)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 5 lun 0 (pass1,da1)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 6 lun 0 (pass2,da2)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 7 lun 0 (pass3,da3)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 8 lun 0 (pass4,da4)
<SEAGATE ST12000NM007H FE05>       at scbus1 target 9 lun 0 (da5,pass5)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 10 lun 0 (pass6,da6)
<SEAGATE ST12000NM007H FE05>       at scbus1 target 11 lun 0 (da7,pass7)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 12 lun 0 (pass8,da8)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 13 lun 0 (pass9,da9)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 14 lun 0 (pass10,da10)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 15 lun 0 (pass11,da11)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 16 lun 0 (pass12,da12)
<SEAGATE ST12000NM007H FE05>       at scbus1 target 17 lun 0 (da13,pass13)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 18 lun 0 (pass14,da14)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 19 lun 0 (pass15,da15)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 20 lun 0 (pass16,da16)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 21 lun 0 (pass17,da17)
<SEAGATE ST12000NM007H FE05>       at scbus1 target 22 lun 0 (da18,pass18)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 23 lun 0 (pass19,da19)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 24 lun 0 (pass20,da20)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 25 lun 0 (pass21,da21)
<SEAGATE ST12000NM007H FE05>       at scbus1 target 26 lun 0 (da23,pass23)
<SEAGATE ST12000NM007H FE05>       at scbus1 target 27 lun 0 (da22,pass22)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 28 lun 0 (pass24,da24)
<SEAGATE ST12000NM007H FE05>       at scbus1 target 29 lun 0 (da25,pass25)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 30 lun 0 (pass26,da26)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 31 lun 0 (pass27,da27)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 32 lun 0 (pass28,da28)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 33 lun 0 (pass29,da29)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 34 lun 0 (pass30,da30)
<SEAGATE ST12000NM007H FE05>       at scbus1 target 35 lun 0 (da31,pass31)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 36 lun 0 (pass32,da32)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 37 lun 0 (pass33,da33)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 38 lun 0 (pass34,da34)
<ATA HGST HUH721212AL T3D0>        at scbus1 target 39 lun 0 (pass35,da35)
<ADAPTEC AEC-82885T B053>          at scbus3 target 0 lun 0 (pass36,ses0)
<ADAPTEC Virtual SGPIO 1>          at scbus3 target 1 lun 0 (pass37,ses1)
<Samsung SSD 860 PRO 512GB RVM01B6Q>  at scbus6 target 0 lun 0 (pass38,ada0)
<Samsung SSD 860 PRO 512GB RVM01B6Q>  at scbus7 target 0 lun 0 (pass39,ada1)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus12 target 0 lun 0 (pass40,ses2)
# sas2flash --list
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01) 
Copyright (c) 2008-2013 LSI Corporation. All rights reserved 

	No LSI SAS adapters found! Limited Command Set Available!
	ERROR:  Invalid command --list

	Exiting Program.
# sas3flash --list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

	No Avago SAS adapters found! Limited Command Set Available!
	ERROR:  Invalid command --list

	Exiting Program.

No 9200 or 9300 SAS HBA… How are the drives attached?

Beside raidz2-3 and its many checksum failures (cables? controller?), raidz2-1 has one drive with read and write errors which could be failing as well.
Do you have a backup?

Well, I typically use ‘arcconf’ to check the controller if that helps.

# arcconf
Controllers found: 1

  | UCLI |  Microsemi Adaptec uniform command line interface
  | UCLI |  Version 3.01 (B23531)
  | UCLI |  (C) Microsemi Corporation 2003-2019
  | UCLI |  All Rights Reserved

As noted in my reply, I’m in the process of replacing that disk, which as of this morning has successfully completed.

And nope, no backups. I’m working on trying to get a solution in place and have a mandate now to come up with a hybrid cloud solution.