Degraded pool in a bit of a weird state

craigdt · May 21, 2024, 8:29am

Hi all,

So my main pool has gotten into a bit of a weird degraded state. I’ll give the story in point form, then I’ll provide the zpool status:

End of last year I moved my MainNAS server to a new case along with 2 new (second hand HBA from art of server)
At the end of January one of the scrubs faulted one of the disks in the pool
I then initiated a HDD replacement in a spare bay with an existing 6TB disk I had.
During the resilvering process, the whole pool went offline due to a whole vdev went going offline.
I then shut down the server as we were less than 2 weeks away from going on a family holiday and I just didnt have the time to deal with it.
After the holiday when I had a little bit of time I pulled all the HDD’s out of the server and put them aside and populated the server with old smaller HDD’s that had nothing on them for testing purposes, I had diagnosed that a HBA had overheated when I had that failed resilvering process in late January.
I have rectified the HBA’s overheating by putting an additional fan over their heat sinks and confirmed its working fine with the test HDD’s I have, running both badblocks and filling up the test pool with random data and running multipe scrubs. Note this was on a fresh latest copy of TrueNAS Core that isn’t using my original config. I didn’t want jails and vm’s running in the mix.
Last night I had put the original HDD’s back into the server and imported the pool into the fresh copy of TrueNAS Core.
The pool imported successfully but went into a resilvering process for a number of HDD’s
16hours later the resilvering process has completed and the pool is available again, but in a DEGRADED state.

Here is the output of the zpool status:

root@mainnas[~]# zpool status Volume
  pool: Volume
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 252K in 00:00:02 with 0 errors on Tue May 21 09:12:22 2024
config:

        NAME                                              STATE     READ WRITE CKSUM
        Volume                                            DEGRADED     0     0     0
          raidz2-0                                        ONLINE       0     0     0
            gptid/f81a58ad-530a-11eb-a2fa-002590f0cf92    ONLINE       0     0     0
            gptid/5923e69a-5361-11eb-a2fa-002590f0cf92    ONLINE       0     0     0
            gptid/ef4c370b-53b6-11eb-a2fa-002590f0cf92    ONLINE       0     0     0
            gptid/f05fd677-22a5-11ed-a6f6-002590f0cf92    ONLINE       0     0     0
            gptid/d7d639dd-5267-11eb-a2fa-002590f0cf92    ONLINE       0     0     0
            gptid/79beb246-52bb-11eb-a2fa-002590f0cf92    ONLINE       0     0     0
          raidz2-1                                        DEGRADED     0     0     0
            gptid/c30cdc67-55b2-11e7-8f32-002590f0cf92    ONLINE       0     0     0
            gptid/0809c059-56cd-11e7-8a5c-002590f0cf92    ONLINE       0     0     0
            replacing-2                                   DEGRADED     0     0    30
              gptid/71e5b94d-5667-11e7-a11e-002590f0cf92  ONLINE       0     0     0
              gptid/3babc0ab-b6bb-11ee-9379-002590f0cf92  UNAVAIL      3   529     0  cannot open
            gptid/003cde3e-47ff-11e7-8632-002590f0cf92    DEGRADED     0     0     0  too many errors
            gptid/da4fde72-572b-11e7-88f6-002590f0cf92    ONLINE       0     0     2
            gptid/d0eb580c-5803-11e7-adb2-002590f0cf92    ONLINE       0     0     0

errors: No known data errors

I suspect the HDD that is being replaced is ok and the UNAVAIL replacement doesnt need to occur anymore, unsure why the replacement disk was marked UNAVAIL, when I did a gpart I couldnt find it in the list.

I’m thinking I should just remove the UNAVAIL disk and the HDD it was going to replace should be ok, as I suspect it was faulted original back in Jan due to HBA overheating.

So I consulted ChatGTP on how I could perform the commands to remove the UNAVAIL disk and it came up with the following. Because I don’t 100% trust AI due to past experience of me constantly trying to get a correct answer from it and never getting it 100% I thought I would ask here with some real intelligent people if the AI got it correct.
It came up with the following commands to remove the UNAVAIL disk and it seems to think that the disk was going to replace will then go back ONLINE and only leave one disk still in a degraded state.

zpool offline Volume gptid/3babc0ab-b6bb-11ee-9379-002590f0cf92
zpool detach Volume gptid/3babc0ab-b6bb-11ee-9379-002590f0cf92
zpool clear Volume
zpool status Volume

The next step after I complete this one is to deal with the disk:

gptid/003cde3e-47ff-11e7-8632-002590f0cf92    DEGRADED     0     0     0  too many errors

which I suspect is once again the HDD is fine and just a result of the HBA back in January overheating… but that will be the next step after this one…

Any advice would be appreciated ?

essinghigh · May 21, 2024, 10:50am

I’d pull that disk that’s clearly failed and let the replacement run through via GUI.

If you think the second one is fine you can just run zpool clear Volume to clear errors and see if it reoccurs. Focus on the failed drive first.

craigdt · May 22, 2024, 3:48am

Here is an update, I ended up following the commands as I outlined in the first post, this worked…

I then performed a scrub, which was successful.

I was originally thinking that I might have lost the entire pool back when the problems started to happen, but nope, no data loss.

It’s still amazing how ZFS is so resilient.

dagrichards · July 11, 2024, 6:19pm

Similar sounding issue. I had sde fail, pulled the disk and inserted its replacement into the same slot. While the replacement was resilvering the replacement disk failed a smart test and was marked offline:

NAME STATE READ WRITE CKSUM
zpool0 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
sdm2 ONLINE 0 0 0
sdn2 ONLINE 0 0 0
sdc2 ONLINE 0 0 0
c19ba163-c404-49ce-a425-48c80564fa6a ONLINE 0 0 0
sdj2 ONLINE 0 0 0
sdk2 ONLINE 0 0 0
sdi2 ONLINE 0 0 0
replacing-7 DEGRADED 0 0 0
7779937366523433665 FAULTED 0 0 0 was /dev/sde2
c7189f34-57e0-4f48-b4c9-6d1c6faa3560 REMOVED 0 0 0
61ba8a0c-ce93-4873-b4c9-90f73b4c4c02 ONLINE 0 0 0
sdf2 ONLINE 0 0 0
sdd2 ONLINE 0 0 0
sde2 ONLINE 0 0 0

So I am in a bit of a situation. I will back all this data up and try the same procedure as the original poster, and report back.

My adapters are a pair of LSI 9211-81 in IT mode on 20.00.um hazy memory
disks are 11 second hand HGST 6T.

Running on TrueNas scale 24.04.1.1 migrated from a long running Core install

craigdt · July 14, 2024, 4:23am

To be honest, the commands I ran, I was using ChatGPT to diagnose the problem, but I also understood what it was explaining to me. But one can never fully trust today’s AI, it has a tendenancy to get things wrong sometimes.

I formated your zpool status as it was a little hard to read with it all left aligned:

NAME                                              STATE     READ WRITE CKSUM
zpool0                                            DEGRADED     0     0     0
  raidz2-0                                        DEGRADED     0     0     0
    sdm2                                          ONLINE       0     0     0
    sdn2                                          ONLINE       0     0     0
    sdc2                                          ONLINE       0     0     0
    c19ba163-c404-49ce-a425-48c80564fa6a          ONLINE       0     0     0
    sdj2                                          ONLINE       0     0     0
    sdk2                                          ONLINE       0     0     0
    sdi2                                          ONLINE       0     0     0
    replacing-7                                   DEGRADED     0     0     0
      7779937366523433665                         FAULTED      0     0     0 was /dev/sde2
      c7189f34-57e0-4f48-b4c9-6d1c6faa3560        REMOVED      0     0     0
	61ba8a0c-ce93-4873-b4c9-90f73b4c4c02          ONLINE       0     0     0
	sdf2                                          ONLINE       0     0     0
	sdd2                                          ONLINE       0     0     0
	sde2                                          ONLINE       0     0     0

So was c7189f34-57e0-4f48-b4c9-6d1c6faa3560 the original failing sde hdd that you had removed and 7779937366523433665 was the new one you put in its place but it faulted due to a smart failure ?

If that is the case then I would make the assumption that 7779937366523433665 should be removed and then from the UI should be-able to start all over again with a hdd that hasnt got issues…