Hi all,
So my main pool has gotten into a bit of a weird degraded state. I’ll give the story in point form, then I’ll provide the zpool status:
- End of last year I moved my MainNAS server to a new case along with 2 new (second hand HBA from art of server)
- At the end of January one of the scrubs faulted one of the disks in the pool
- I then initiated a HDD replacement in a spare bay with an existing 6TB disk I had.
- During the resilvering process, the whole pool went offline due to a whole vdev went going offline.
- I then shut down the server as we were less than 2 weeks away from going on a family holiday and I just didnt have the time to deal with it.
- After the holiday when I had a little bit of time I pulled all the HDD’s out of the server and put them aside and populated the server with old smaller HDD’s that had nothing on them for testing purposes, I had diagnosed that a HBA had overheated when I had that failed resilvering process in late January.
- I have rectified the HBA’s overheating by putting an additional fan over their heat sinks and confirmed its working fine with the test HDD’s I have, running both badblocks and filling up the test pool with random data and running multipe scrubs. Note this was on a fresh latest copy of TrueNAS Core that isn’t using my original config. I didn’t want jails and vm’s running in the mix.
- Last night I had put the original HDD’s back into the server and imported the pool into the fresh copy of TrueNAS Core.
- The pool imported successfully but went into a resilvering process for a number of HDD’s
- 16hours later the resilvering process has completed and the pool is available again, but in a DEGRADED state.
Here is the output of the zpool status:
root@mainnas[~]# zpool status Volume
pool: Volume
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: resilvered 252K in 00:00:02 with 0 errors on Tue May 21 09:12:22 2024
config:
NAME STATE READ WRITE CKSUM
Volume DEGRADED 0 0 0
raidz2-0 ONLINE 0 0 0
gptid/f81a58ad-530a-11eb-a2fa-002590f0cf92 ONLINE 0 0 0
gptid/5923e69a-5361-11eb-a2fa-002590f0cf92 ONLINE 0 0 0
gptid/ef4c370b-53b6-11eb-a2fa-002590f0cf92 ONLINE 0 0 0
gptid/f05fd677-22a5-11ed-a6f6-002590f0cf92 ONLINE 0 0 0
gptid/d7d639dd-5267-11eb-a2fa-002590f0cf92 ONLINE 0 0 0
gptid/79beb246-52bb-11eb-a2fa-002590f0cf92 ONLINE 0 0 0
raidz2-1 DEGRADED 0 0 0
gptid/c30cdc67-55b2-11e7-8f32-002590f0cf92 ONLINE 0 0 0
gptid/0809c059-56cd-11e7-8a5c-002590f0cf92 ONLINE 0 0 0
replacing-2 DEGRADED 0 0 30
gptid/71e5b94d-5667-11e7-a11e-002590f0cf92 ONLINE 0 0 0
gptid/3babc0ab-b6bb-11ee-9379-002590f0cf92 UNAVAIL 3 529 0 cannot open
gptid/003cde3e-47ff-11e7-8632-002590f0cf92 DEGRADED 0 0 0 too many errors
gptid/da4fde72-572b-11e7-88f6-002590f0cf92 ONLINE 0 0 2
gptid/d0eb580c-5803-11e7-adb2-002590f0cf92 ONLINE 0 0 0
errors: No known data errors
I suspect the HDD that is being replaced is ok and the UNAVAIL replacement doesnt need to occur anymore, unsure why the replacement disk was marked UNAVAIL, when I did a gpart I couldnt find it in the list.
I’m thinking I should just remove the UNAVAIL disk and the HDD it was going to replace should be ok, as I suspect it was faulted original back in Jan due to HBA overheating.
So I consulted ChatGTP on how I could perform the commands to remove the UNAVAIL disk and it came up with the following. Because I don’t 100% trust AI due to past experience of me constantly trying to get a correct answer from it and never getting it 100% I thought I would ask here with some real intelligent people if the AI got it correct.
It came up with the following commands to remove the UNAVAIL disk and it seems to think that the disk was going to replace will then go back ONLINE and only leave one disk still in a degraded state.
zpool offline Volume gptid/3babc0ab-b6bb-11ee-9379-002590f0cf92
zpool detach Volume gptid/3babc0ab-b6bb-11ee-9379-002590f0cf92
zpool clear Volume
zpool status Volume
The next step after I complete this one is to deal with the disk:
gptid/003cde3e-47ff-11e7-8632-002590f0cf92 DEGRADED 0 0 0 too many errors
which I suspect is once again the HDD is fine and just a result of the HBA back in January overheating… but that will be the next step after this one…
Any advice would be appreciated ?