I have a zpool of 8 drives in RAIDZ2. The system has been running for a number of years now with no major issues that I haven’t been able to overcome.
I’ve been replacing some drives in my pool to eventually increase the size of the pool. I’ve been replacing one drive at a time so I always have a buffer if another existing drive fails during resilver.
On the latest replacement, the drive came up as /dev/sdk. After doing the replacement in the UI and a couple of hours passing, it started to show faults during the resilver. It eventually went “FAULTED” in the UI. The resilver process continued though. As I wasn’t sure what to do, I let the resilver complete before doing anything. Once complete the drive was still showing faulted. I rebooted Truenas and upon boot up the new drive was now showing “UNAVAILABLE”. So in the “REPLACING” section for the outgoing drive and the new drive, I now had one showing “REMOVED”, and the other showing “UNAVAILABLE”.
Luckily, I purchased two new drives so my plan was just to replace the replacement, and then start a warranty claim with Seagate on this new failed one.
In the UI, the one showing as “UNAVAILABLE”, I removed.
I put the second new drive in its place, but it wont show up in the UI to select as a new replacement for the originally removed drive.
If I do lsblk, I can see this second new drive is also showing up as /dev/sdk again but it showing 0 bytes. The LED on the drive bay continues do the “new drive setup flash”. The drive /dev/sdk, does not show up at all if I run fdisk -l.
I’m not sure how to troubleshoot this further. I would really appreciate some assistance. Both of these drives are Seagate Exos SAS 16TB.
My pool continues to run albeit in a degraded state with 1 disk missing.
I tried rebooting Truenas with this second replacement drive disconnected.
I plugged it in and monitored dmesg. This time it showed up as /dev/sdo. But it is still showing capacity 0B.
This is what I captured in dmsg when plugging it in:
[Sun Oct 20 13:12:22 2024] mpt3sas_cm1: handle(0x17) sas_address(0x5000c500d917b71d) port_type(0x1)
[Sun Oct 20 13:12:22 2024] mpt3sas_cm1: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: Direct-Access SEAGATE ST16000NM002G E003 PQ: 0 ANSI: 7
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: SSP: handle(0x0017), sas_addr(0x5000c500d917b71d), phy(5), device_name(0x5000c500d917b71c)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: enclosure logical id (0x500304802107f73f), slot(5)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: enclosure level(0x0000), connector name( )
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: qdepth(254), tagged(1), scsi_level(8), cmd_que(1)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: Power-on or device reset occurred
[Sun Oct 20 13:12:22 2024] sd 4:0:12:0: Attached scsi generic sg15 type 0
[Sun Oct 20 13:12:22 2024] [98]: scst: Attached to scsi4, channel 0, id 12, lun 0, type 0
[Sun Oct 20 13:12:22 2024] end_device-4:0:8: add: handle(0x0017), sas_addr(0x5000c500d917b71d)
[Sun Oct 20 13:12:22 2024] sd 4:0:12:0: [sdo] Spinning up disk...
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Sense Key : Not Ready [current] [descriptor]
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Add. Sense: Logical unit is in process of becoming ready
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Sense Key : Not Ready [current] [descriptor]
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Add. Sense: Logical unit is in process of becoming ready
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] 0 512-byte logical blocks: (0 B/0 B)
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] 0-byte physical blocks
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Test WP failed, assume Write Enabled
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Asking for cache data failed
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Assuming drive cache: write through
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Preferred minimum I/O size 4096 bytes not a multiple of physical block size (0 bytes)
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Attached SCSI disk
lsblk:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 20G 0 disk
├─sda1 8:1 0 1M 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 19.5G 0 part
sdb 8:16 0 931.5G 0 disk
├─sdb1 8:17 0 400G 0 part
├─sdb2 8:18 0 400G 0 part
├─sdb3 8:19 0 16G 0 part
└─sdb4 8:20 0 16G 0 part
sdc 8:32 0 931.5G 0 disk
├─sdc1 8:33 0 400G 0 part
├─sdc2 8:34 0 400G 0 part
├─sdc3 8:35 0 16G 0 part
└─sdc4 8:36 0 16G 0 part
sdd 8:48 0 14.6T 0 disk
├─sdd1 8:49 0 2G 0 part
└─sdd2 8:50 0 14.6T 0 part
sde 8:64 0 14.6T 0 disk
├─sde1 8:65 0 2G 0 part
└─sde2 8:66 0 14.6T 0 part
sdf 8:80 0 14.6T 0 disk
├─sdf1 8:81 0 2G 0 part
└─sdf2 8:82 0 14.6T 0 part
sdg 8:96 0 14.6T 0 disk
├─sdg1 8:97 0 2G 0 part
└─sdg2 8:98 0 14.6T 0 part
sdh 8:112 0 14.6T 0 disk
├─sdh1 8:113 0 2G 0 part
└─sdh2 8:114 0 14.6T 0 part
sdi 8:128 0 14.6T 0 disk
└─sdi1 8:129 0 14.6T 0 part
sdj 8:144 0 5.5T 0 disk
├─sdj1 8:145 0 2G 0 part
└─sdj2 8:146 0 5.5T 0 part
sdk 8:160 0 14.6T 0 disk
├─sdk1 8:161 0 2G 0 part
└─sdk2 8:162 0 14.6T 0 part
sdl 8:176 0 14.6T 0 disk
└─sdl1 8:177 0 14.6T 0 part
sdm 8:192 0 14.6T 0 disk
├─sdm1 8:193 0 2G 0 part
└─sdm2 8:194 0 14.6T 0 part
sdn 8:208 0 5.5T 0 disk
├─sdn1 8:209 0 2G 0 part
└─sdn2 8:210 0 5.5T 0 part
sdo 8:224 0 0B 0 disk
I also tried to run fsck on the drive:
fsck from util-linux 2.38.1
e2fsck 1.47.0 (5-Feb-2023)
fsck.ext2: Invalid argument while trying to open /dev/sdo
The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem. If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
or
e2fsck -b 32768 <device>
I’m getting the same issue when introducing a new 7200rpm 1TB drive into my Jbod that has 16 600GB 15k’s in the other bays. I’m trying to populate the rest of the slots and whilst I know it might impact performance, the server HBA is only 6G so it should be in the realms of possibility…