Pool degraded, ghost of old disk follow me

Hi all,
I had the following problem:
I have a configuration with 6 disks (4+2) and the other day disk2 and disk3 quickly failed.
I replace disk2 and do resilvering, everything is ok.
I replace disk3 and do resilvering, everything is ok.
I start a scrub and around 30% I get the degraded pool warning… but the serial is of the old disk2 on the position of new disk3…
If i reboot, pool is ok (not degraded)

I also carried out an in-depth smart test on both replaced disks and they are ok…

what can I check?

Thank you

First, output of zpool status -v, formatted as code please.
Then more details about the hardware, especially controllers, and TrueNAS version.


root@truenas[~]# zpool status -v
  pool: NAS32TB
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub in progress since Fri Aug 16 16:58:27 2024
        14.2T scanned at 956M/s, 12.5T issued at 844M/s, 28.4T total
        20.8M repaired, 44.09% done, 05:28:26 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        NAS32TB                                         DEGRADED     0     0 0
          raidz2-0                                      DEGRADED     0     0 0
            gptid/2056e36d-5af1-11ed-ac4c-0cc47ad9cf64  ONLINE       0     0 0
            gptid/2060cb7e-5af1-11ed-ac4c-0cc47ad9cf64  ONLINE       0     0 0
            gptid/9f1b3bdd-58e0-11ef-b131-0cc47ad9cf64  ONLINE       0     0 0
            gptid/52f7af41-5944-11ef-b131-0cc47ad9cf64  FAULTED    727     0 0  too many errors
            gptid/20398d95-5af1-11ed-ac4c-0cc47ad9cf64  ONLINE       0     0 0
            gptid/208ab6cd-5af1-11ed-ac4c-0cc47ad9cf64  ONLINE       0     0 0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:29 with 0 errors on Tue Aug 13 03:45:29 2024
config:

Motherboard: Supermicro X10SL7-F
CPU: i3-4170
RAM: 16GB ECC

hard disk was 6x WD80EFZZ, the new 2 are Seagate Exos

1 Like

Good hardware (provided that the onboard SAS controller is flashed to the appropriate latest firmware), sane pool layout (I was a bit worried by the “4+2”).

Not sure where the “disk2” and “disk3” monikers come from. The order in which drives are listed, as well as derive numbers (da#/ada#) or letters (sdX) may change across reboots.

If the drive with GPTID is new (burnt in?) 52f7af41-5944-11ef-b131-0cc47ad9cf64 has good long SMART tests you may want to check or replace the data cable. But the base hypothesis is that this drive is failing (this may happen even with new drives) and needs to be replaced.

This is going to sound condescending & I don’t mean it to; did you physically validate the serial numbers on the disks to make sure you didn’t put the old one back in by mistake?

1 Like

Also that would explain the twin failures in succession.

I’ve personally removed the wrong disk because of the “was blah blah” message referring to the “wrong” disk (it’s historical and doesn’t represent the current state necessarily)

Thus it is necessary to verify the serials when pulling the disk.

1 Like

Impossible. I externally labeled each disk with its serial number.
I also replaced the WDs with Seagates. They are different aesthetically

Ok this answers why I am given the serial number of the old disk… is there a way to update the serial numbers?

Update: now disk2 also gives it to me degraded again… wtf?
Now i’ve tried to replace disk3 and change SATA cable and port on mobo…
Could it be a power supply problem?
They are both drives on the same cable

Yes. Having ruled out the obvious, it’s time to consider the (not so) unplausible.