Hi all,
I have a zpool of 8 drives in RAIDZ2. The system has been running for a number of years now with no major issues that I haven’t been able to overcome.
I’ve been replacing some drives in my pool to eventually increase the size of the pool. I’ve been replacing one drive at a time so I always have a buffer if another existing drive fails during resilver.
On the latest replacement, the drive came up as /dev/sdk
. After doing the replacement in the UI and a couple of hours passing, it started to show faults during the resilver. It eventually went “FAULTED” in the UI. The resilver process continued though. As I wasn’t sure what to do, I let the resilver complete before doing anything. Once complete the drive was still showing faulted. I rebooted Truenas and upon boot up the new drive was now showing “UNAVAILABLE”. So in the “REPLACING” section for the outgoing drive and the new drive, I now had one showing “REMOVED”, and the other showing “UNAVAILABLE”.
Luckily, I purchased two new drives so my plan was just to replace the replacement, and then start a warranty claim with Seagate on this new failed one.
In the UI, the one showing as “UNAVAILABLE”, I removed.
I put the second new drive in its place, but it wont show up in the UI to select as a new replacement for the originally removed drive.
If I do lsblk
, I can see this second new drive is also showing up as /dev/sdk
again but it showing 0 bytes. The LED on the drive bay continues do the “new drive setup flash”. The drive /dev/sdk
, does not show up at all if I run fdisk -l
.
I’m not sure how to troubleshoot this further. I would really appreciate some assistance. Both of these drives are Seagate Exos SAS 16TB.
My pool continues to run albeit in a degraded state with 1 disk missing.
Any assistance would be greatly appreciated.
Thanks,
D
zpool status pool1
pool: pool1
state: DEGRADED
status: One or more devices has been removed by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using zpool online' or replace the device with
'zpool replace'.
scan: resilvered 656G in 20:45:25 with 0 errors on Sun Oct 20 04:00:17 2024
config:
NAME STATE READ WRITE CKSUM
pool1 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
7dae4acc-853b-48bf-93fd-ec3f213c4c77 ONLINE 0 0 0
5efaf86e-70d3-4373-8a28-1d5fb9666407 ONLINE 0 0 0
200b4586-6681-11eb-aa3b-13062a609593 ONLINE 0 0 0
1fb8609d-6681-11eb-aa3b-13062a609593 ONLINE 0 0 0
20321da4-6681-11eb-aa3b-13062a609593 REMOVED 0 0 0
0d89d522-d06e-49bb-a0ec-0acd912698df ONLINE 0 0 0
fa486d40-4554-4a53-8ed0-cfcc39c05c20 ONLINE 0 0 0
521efaab-5f86-4058-9496-de5bfe116510 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
d3814206-6740-11eb-8512-ebe22aa91927 ONLINE 0 0 0
e1e036ba-6740-11eb-8512-ebe22aa91927 ONLINE 0 0 0
cache
sdb2 ONLINE 0 0 0
sdc2 ONLINE 0 0 0
errors: No known data errors
I tried rebooting Truenas with this second replacement drive disconnected.
I plugged it in and monitored dmesg
. This time it showed up as /dev/sdo
. But it is still showing capacity 0B
.
This is what I captured in dmsg
when plugging it in:
[Sun Oct 20 13:12:22 2024] mpt3sas_cm1: handle(0x17) sas_address(0x5000c500d917b71d) port_type(0x1)
[Sun Oct 20 13:12:22 2024] mpt3sas_cm1: log_info(0x31110e05): originator(PL), code(0x11), sub_code(0x0e05)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: Direct-Access SEAGATE ST16000NM002G E003 PQ: 0 ANSI: 7
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: SSP: handle(0x0017), sas_addr(0x5000c500d917b71d), phy(5), device_name(0x5000c500d917b71c)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: enclosure logical id (0x500304802107f73f), slot(5)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: enclosure level(0x0000), connector name( )
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: qdepth(254), tagged(1), scsi_level(8), cmd_que(1)
[Sun Oct 20 13:12:22 2024] scsi 4:0:12:0: Power-on or device reset occurred
[Sun Oct 20 13:12:22 2024] sd 4:0:12:0: Attached scsi generic sg15 type 0
[Sun Oct 20 13:12:22 2024] [98]: scst: Attached to scsi4, channel 0, id 12, lun 0, type 0
[Sun Oct 20 13:12:22 2024] end_device-4:0:8: add: handle(0x0017), sas_addr(0x5000c500d917b71d)
[Sun Oct 20 13:12:22 2024] sd 4:0:12:0: [sdo] Spinning up disk...
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Read Capacity(16) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Sense Key : Not Ready [current] [descriptor]
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Add. Sense: Logical unit is in process of becoming ready
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Read Capacity(10) failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Sense Key : Not Ready [current] [descriptor]
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Add. Sense: Logical unit is in process of becoming ready
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] 0 512-byte logical blocks: (0 B/0 B)
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] 0-byte physical blocks
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Test WP failed, assume Write Enabled
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Asking for cache data failed
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Assuming drive cache: write through
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Preferred minimum I/O size 4096 bytes not a multiple of physical block size (0 bytes)
[Sun Oct 20 13:14:03 2024] sd 4:0:12:0: [sdo] Attached SCSI disk
lsblk
:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 0 20G 0 disk
├─sda1 8:1 0 1M 0 part
├─sda2 8:2 0 512M 0 part
└─sda3 8:3 0 19.5G 0 part
sdb 8:16 0 931.5G 0 disk
├─sdb1 8:17 0 400G 0 part
├─sdb2 8:18 0 400G 0 part
├─sdb3 8:19 0 16G 0 part
└─sdb4 8:20 0 16G 0 part
sdc 8:32 0 931.5G 0 disk
├─sdc1 8:33 0 400G 0 part
├─sdc2 8:34 0 400G 0 part
├─sdc3 8:35 0 16G 0 part
└─sdc4 8:36 0 16G 0 part
sdd 8:48 0 14.6T 0 disk
├─sdd1 8:49 0 2G 0 part
└─sdd2 8:50 0 14.6T 0 part
sde 8:64 0 14.6T 0 disk
├─sde1 8:65 0 2G 0 part
└─sde2 8:66 0 14.6T 0 part
sdf 8:80 0 14.6T 0 disk
├─sdf1 8:81 0 2G 0 part
└─sdf2 8:82 0 14.6T 0 part
sdg 8:96 0 14.6T 0 disk
├─sdg1 8:97 0 2G 0 part
└─sdg2 8:98 0 14.6T 0 part
sdh 8:112 0 14.6T 0 disk
├─sdh1 8:113 0 2G 0 part
└─sdh2 8:114 0 14.6T 0 part
sdi 8:128 0 14.6T 0 disk
└─sdi1 8:129 0 14.6T 0 part
sdj 8:144 0 5.5T 0 disk
├─sdj1 8:145 0 2G 0 part
└─sdj2 8:146 0 5.5T 0 part
sdk 8:160 0 14.6T 0 disk
├─sdk1 8:161 0 2G 0 part
└─sdk2 8:162 0 14.6T 0 part
sdl 8:176 0 14.6T 0 disk
└─sdl1 8:177 0 14.6T 0 part
sdm 8:192 0 14.6T 0 disk
├─sdm1 8:193 0 2G 0 part
└─sdm2 8:194 0 14.6T 0 part
sdn 8:208 0 5.5T 0 disk
├─sdn1 8:209 0 2G 0 part
└─sdn2 8:210 0 5.5T 0 part
sdo 8:224 0 0B 0 disk
I also tried to run fsck
on the drive:
fsck from util-linux 2.38.1
e2fsck 1.47.0 (5-Feb-2023)
fsck.ext2: Invalid argument while trying to open /dev/sdo
The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem. If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
or
e2fsck -b 32768 <device>
I wanted to upload some screenshots of what I’m seeing in the UI, but this new platform doesn’t seem to allow PNG file being added to the post.
I feel like this could be something I have done wrong after that initial replacement disk failed during resilver.
Can you anyone advise?
I have already started the RMA on the initial one that failed. But for two brand-new drives to effectively be DOA seems like extremely bad luck.