Degraded pool in a bit of a weird state

Hi all,

So my main pool has gotten into a bit of a weird degraded state. I’ll give the story in point form, then I’ll provide the zpool status:

  • End of last year I moved my MainNAS server to a new case along with 2 new (second hand HBA from art of server)
  • At the end of January one of the scrubs faulted one of the disks in the pool
  • I then initiated a HDD replacement in a spare bay with an existing 6TB disk I had.
  • During the resilvering process, the whole pool went offline due to a whole vdev went going offline.
  • I then shut down the server as we were less than 2 weeks away from going on a family holiday and I just didnt have the time to deal with it.
  • After the holiday when I had a little bit of time I pulled all the HDD’s out of the server and put them aside and populated the server with old smaller HDD’s that had nothing on them for testing purposes, I had diagnosed that a HBA had overheated when I had that failed resilvering process in late January.
  • I have rectified the HBA’s overheating by putting an additional fan over their heat sinks and confirmed its working fine with the test HDD’s I have, running both badblocks and filling up the test pool with random data and running multipe scrubs. Note this was on a fresh latest copy of TrueNAS Core that isn’t using my original config. I didn’t want jails and vm’s running in the mix.
  • Last night I had put the original HDD’s back into the server and imported the pool into the fresh copy of TrueNAS Core.
  • The pool imported successfully but went into a resilvering process for a number of HDD’s
  • 16hours later the resilvering process has completed and the pool is available again, but in a DEGRADED state.

Here is the output of the zpool status:

root@mainnas[~]# zpool status Volume
  pool: Volume
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 252K in 00:00:02 with 0 errors on Tue May 21 09:12:22 2024
config:

        NAME                                              STATE     READ WRITE CKSUM
        Volume                                            DEGRADED     0     0     0
          raidz2-0                                        ONLINE       0     0     0
            gptid/f81a58ad-530a-11eb-a2fa-002590f0cf92    ONLINE       0     0     0
            gptid/5923e69a-5361-11eb-a2fa-002590f0cf92    ONLINE       0     0     0
            gptid/ef4c370b-53b6-11eb-a2fa-002590f0cf92    ONLINE       0     0     0
            gptid/f05fd677-22a5-11ed-a6f6-002590f0cf92    ONLINE       0     0     0
            gptid/d7d639dd-5267-11eb-a2fa-002590f0cf92    ONLINE       0     0     0
            gptid/79beb246-52bb-11eb-a2fa-002590f0cf92    ONLINE       0     0     0
          raidz2-1                                        DEGRADED     0     0     0
            gptid/c30cdc67-55b2-11e7-8f32-002590f0cf92    ONLINE       0     0     0
            gptid/0809c059-56cd-11e7-8a5c-002590f0cf92    ONLINE       0     0     0
            replacing-2                                   DEGRADED     0     0    30
              gptid/71e5b94d-5667-11e7-a11e-002590f0cf92  ONLINE       0     0     0
              gptid/3babc0ab-b6bb-11ee-9379-002590f0cf92  UNAVAIL      3   529     0  cannot open
            gptid/003cde3e-47ff-11e7-8632-002590f0cf92    DEGRADED     0     0     0  too many errors
            gptid/da4fde72-572b-11e7-88f6-002590f0cf92    ONLINE       0     0     2
            gptid/d0eb580c-5803-11e7-adb2-002590f0cf92    ONLINE       0     0     0

errors: No known data errors

I suspect the HDD that is being replaced is ok and the UNAVAIL replacement doesnt need to occur anymore, unsure why the replacement disk was marked UNAVAIL, when I did a gpart I couldnt find it in the list.

I’m thinking I should just remove the UNAVAIL disk and the HDD it was going to replace should be ok, as I suspect it was faulted original back in Jan due to HBA overheating.

So I consulted ChatGTP on how I could perform the commands to remove the UNAVAIL disk and it came up with the following. Because I don’t 100% trust AI due to past experience of me constantly trying to get a correct answer from it and never getting it 100% I thought I would ask here with some real intelligent people if the AI got it correct.
It came up with the following commands to remove the UNAVAIL disk and it seems to think that the disk was going to replace will then go back ONLINE and only leave one disk still in a degraded state.

zpool offline Volume gptid/3babc0ab-b6bb-11ee-9379-002590f0cf92
zpool detach Volume gptid/3babc0ab-b6bb-11ee-9379-002590f0cf92
zpool clear Volume
zpool status Volume

The next step after I complete this one is to deal with the disk:

gptid/003cde3e-47ff-11e7-8632-002590f0cf92    DEGRADED     0     0     0  too many errors

which I suspect is once again the HDD is fine and just a result of the HBA back in January overheating… but that will be the next step after this one…

Any advice would be appreciated ?

I’d pull that disk that’s clearly failed and let the replacement run through via GUI.

If you think the second one is fine you can just run zpool clear Volume to clear errors and see if it reoccurs. Focus on the failed drive first.

Here is an update, I ended up following the commands as I outlined in the first post, this worked…

I then performed a scrub, which was successful.

I was originally thinking that I might have lost the entire pool back when the problems started to happen, but nope, no data loss.

It’s still amazing how ZFS is so resilient.