New drive failed during resilver. Second new drive not selectable as replacement

damienbutt · October 20, 2024, 6:06am

Hi all,

I have a zpool of 8 drives in RAIDZ2. The system has been running for a number of years now with no major issues that I haven’t been able to overcome.

I’ve been replacing some drives in my pool to eventually increase the size of the pool. I’ve been replacing one drive at a time so I always have a buffer if another existing drive fails during resilver.

On the latest replacement, the drive came up as /dev/sdk. After doing the replacement in the UI and a couple of hours passing, it started to show faults during the resilver. It eventually went “FAULTED” in the UI. The resilver process continued though. As I wasn’t sure what to do, I let the resilver complete before doing anything. Once complete the drive was still showing faulted. I rebooted Truenas and upon boot up the new drive was now showing “UNAVAILABLE”. So in the “REPLACING” section for the outgoing drive and the new drive, I now had one showing “REMOVED”, and the other showing “UNAVAILABLE”.

Luckily, I purchased two new drives so my plan was just to replace the replacement, and then start a warranty claim with Seagate on this new failed one.

In the UI, the one showing as “UNAVAILABLE”, I removed.

I put the second new drive in its place, but it wont show up in the UI to select as a new replacement for the originally removed drive.

If I do lsblk, I can see this second new drive is also showing up as /dev/sdk again but it showing 0 bytes. The LED on the drive bay continues do the “new drive setup flash”. The drive /dev/sdk, does not show up at all if I run fdisk -l.

I’m not sure how to troubleshoot this further. I would really appreciate some assistance. Both of these drives are Seagate Exos SAS 16TB.

My pool continues to run albeit in a degraded state with 1 disk missing.

Any assistance would be greatly appreciated.

Thanks,
D

damienbutt · October 20, 2024, 6:13am

zpool status pool1
  pool: pool1
 state: DEGRADED
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using zpool online' or replace the device with
	'zpool replace'.
  scan: resilvered 656G in 20:45:25 with 0 errors on Sun Oct 20 04:00:17 2024
config:

	NAME                                      STATE     READ WRITE CKSUM
	pool1                                     DEGRADED     0     0     0
	 raidz2-0                                DEGRADED     0     0     0
	   7dae4acc-853b-48bf-93fd-ec3f213c4c77  ONLINE       0     0     0
	   5efaf86e-70d3-4373-8a28-1d5fb9666407  ONLINE       0     0     0
	   200b4586-6681-11eb-aa3b-13062a609593  ONLINE       0     0     0
	   1fb8609d-6681-11eb-aa3b-13062a609593  ONLINE       0     0     0
	   20321da4-6681-11eb-aa3b-13062a609593  REMOVED      0     0     0
	   0d89d522-d06e-49bb-a0ec-0acd912698df  ONLINE       0     0     0
	   fa486d40-4554-4a53-8ed0-cfcc39c05c20  ONLINE       0     0     0
	   521efaab-5f86-4058-9496-de5bfe116510  ONLINE       0     0     0
	logs	
	 mirror-1                                ONLINE       0     0     0
	   d3814206-6740-11eb-8512-ebe22aa91927  ONLINE       0     0     0
	   e1e036ba-6740-11eb-8512-ebe22aa91927  ONLINE       0     0     0
	cache
	 sdb2                                    ONLINE       0     0     0
	 sdc2                                    ONLINE       0     0     0

errors: No known data errors

damienbutt · October 20, 2024, 1:31pm

I tried rebooting Truenas with this second replacement drive disconnected.

I plugged it in and monitored dmesg. This time it showed up as /dev/sdo. But it is still showing capacity 0B.

This is what I captured in dmsg when plugging it in:

[Sun Oct 20 13:12:22 2024] mpt3sas_cm1: handle(0x17) sas_address(0x5000c500d917b71d) port_type(0x1)
[Sun Oct 20 13:12:22 2024] mpt3sas_cm1: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: Direct-Access     SEAGATE  ST16000NM002G    E003 PQ: 0 ANSI: 7
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: SSP: handle(0x0017), sas_addr(0x5000c500d917b71d), phy(5), device_name(0x5000c500d917b71c)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: enclosure logical id (0x500304802107f73f), slot(5)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: enclosure level(0x0000), connector name(     )
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: qdepth(254), tagged(1), scsi_level(8), cmd_que(1)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: Power-on or device reset occurred
[Sun Oct 20 13:12:22 2024] sd 4:0:12:0: Attached scsi generic sg15 type 0
[Sun Oct 20 13:12:22 2024] [98]: scst: Attached to scsi4, channel 0, id 12, lun 0, type 0
[Sun Oct 20 13:12:22 2024]  end_device-4:0:8: add: handle(0x0017), sas_addr(0x5000c500d917b71d)
[Sun Oct 20 13:12:22 2024] sd 4:0:12:0: [sdo] Spinning up disk...
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Sense Key : Not Ready [current] [descriptor]
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Add. Sense: Logical unit is in process of becoming ready
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Sense Key : Not Ready [current] [descriptor]
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Add. Sense: Logical unit is in process of becoming ready
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] 0 512-byte logical blocks: (0 B/0 B)
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] 0-byte physical blocks
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Test WP failed, assume Write Enabled
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Asking for cache data failed
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Assuming drive cache: write through
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Preferred minimum I/O size 4096 bytes not a multiple of physical block size (0 bytes)
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Attached SCSI disk

lsblk:

NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda      8:0    0    20G  0 disk
├─sda1   8:1    0     1M  0 part
├─sda2   8:2    0   512M  0 part
└─sda3   8:3    0  19.5G  0 part
sdb      8:16   0 931.5G  0 disk
├─sdb1   8:17   0   400G  0 part
├─sdb2   8:18   0   400G  0 part
├─sdb3   8:19   0    16G  0 part
└─sdb4   8:20   0    16G  0 part
sdc      8:32   0 931.5G  0 disk
├─sdc1   8:33   0   400G  0 part
├─sdc2   8:34   0   400G  0 part
├─sdc3   8:35   0    16G  0 part
└─sdc4   8:36   0    16G  0 part
sdd      8:48   0  14.6T  0 disk
├─sdd1   8:49   0     2G  0 part
└─sdd2   8:50   0  14.6T  0 part
sde      8:64   0  14.6T  0 disk
├─sde1   8:65   0     2G  0 part
└─sde2   8:66   0  14.6T  0 part
sdf      8:80   0  14.6T  0 disk
├─sdf1   8:81   0     2G  0 part
└─sdf2   8:82   0  14.6T  0 part
sdg      8:96   0  14.6T  0 disk
├─sdg1   8:97   0     2G  0 part
└─sdg2   8:98   0  14.6T  0 part
sdh      8:112  0  14.6T  0 disk
├─sdh1   8:113  0     2G  0 part
└─sdh2   8:114  0  14.6T  0 part
sdi      8:128  0  14.6T  0 disk
└─sdi1   8:129  0  14.6T  0 part
sdj      8:144  0   5.5T  0 disk
├─sdj1   8:145  0     2G  0 part
└─sdj2   8:146  0   5.5T  0 part
sdk      8:160  0  14.6T  0 disk
├─sdk1   8:161  0     2G  0 part
└─sdk2   8:162  0  14.6T  0 part
sdl      8:176  0  14.6T  0 disk
└─sdl1   8:177  0  14.6T  0 part
sdm      8:192  0  14.6T  0 disk
├─sdm1   8:193  0     2G  0 part
└─sdm2   8:194  0  14.6T  0 part
sdn      8:208  0   5.5T  0 disk
├─sdn1   8:209  0     2G  0 part
└─sdn2   8:210  0   5.5T  0 part
sdo      8:224  0     0B  0 disk

I also tried to run fsck on the drive:

fsck from util-linux 2.38.1
e2fsck 1.47.0 (5-Feb-2023)
fsck.ext2: Invalid argument while trying to open /dev/sdo

The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

damienbutt · October 20, 2024, 1:39pm

I wanted to upload some screenshots of what I’m seeing in the UI, but this new platform doesn’t seem to allow PNG file being added to the post.

damienbutt · October 20, 2024, 1:46pm

I feel like this could be something I have done wrong after that initial replacement disk failed during resilver.

Can you anyone advise?

I have already started the RMA on the initial one that failed. But for two brand-new drives to effectively be DOA seems like extremely bad luck.

Krebsy · January 22, 2025, 5:29pm

I’m getting the same issue when introducing a new 7200rpm 1TB drive into my Jbod that has 16 600GB 15k’s in the other bays. I’m trying to populate the rest of the slots and whilst I know it might impact performance, the server HBA is only 6G so it should be in the realms of possibility…