Degraded Pool question

im running TrueNAS 25.04.2.5

while doing a scheduled scrub today, one of the pools degraded.
On the Storage Dashboard/ZFS Health it shows Pool Status: Degraded, Total ZFS Errors: 0

If you click on Manage Devices and select the drive that degraded, in the ZFS Errors column it says 87 write errors. This is one of 7 drives in that pool.

does this mean the ZFS data is ok, its just the drive that is throwing errors?

its now doing a resilver that is going to take 1 day/20 hours which seems like the entire drive, not just the damaged part. of course smart says everything is fine so thats no help. Ditto for Scrutiny

any thoughts/guidance on what im seeing?

The drive is about 5 years old, but has another 9 months or so of warranty so i want to keep it running until it actually fails (if this isnt it)

thanks

mark

[edit] i forgot to state i cant do zpool status on the shell, because im not getting a prompt; it stalls after the last login line (probably because its very busy? but cpu usage is under 10% so…)

[edit2] i ssh’d in and when i try to log in, after i enter password it says ā€œEnd of keyboard-interactive prompts from serverā€. a reboot is in line, but i dont want to do that until the resilver is complete. the shares are working fine

Yes.

Did you replace the faulty drive or has a hot-spare kicked in?

This means you have quite a few write errors for the file system. It does not mean a drive has failed, however it does not mean a drive did not fail either.

The resilvering time is an estimate. The longer it runs, the more accurate the estimate is. It might only be 10 hours for example.

The only way to tell if a drive is failing, rad the SMART data, then possibly run a SMART Long test on the drive. Then examine the results.

In my signature is Drive Troubleshooting Flowcharts. Take a look at those, it should give you a reasonable guide to figure it out. If you need further help, ask.

1 Like

i did not replace it yet; smart says its ok, but thats not dispositive

this pool is backup only so not a big risk (and there is another backup elsewhere) so i can wait for the resilver to finish and see what happens.

I’m curious if you haven’t replaced the drive what’s it resilvering?

the drive (and pool) says degraded, not failed, so its resilvering back to the degraded drive.

i didnt do anything; just noticed the error, and at that point it was already 5% resilvered with a couple days to go (since it is still in the middle of a scrub…)

Ah I see. So the scrub caused the drive to error but not enough to be thrown out of the pool completely so now you have a scrub and resilver happening simultaneously hence the long duration wait.

thats what it seems

i also have a broken shell/ssh; tried toggling ssh on/off but no effect.

have to wait a couple days for this all to sort out, then reboot and it should recover then.

the good news; the data on that pool is intact. the bad news; i had raidz3 on all my pools for years, but never had an issue. so changed them to raidz2 when the opportunity arrived. now im wondering if that was an error… i guess in the future i will change the main data pool to raidz3 and leave the backups at raidz2. dont have room to mirror everything or i would do that….

I think for 7 drives then a Z2 is about right. Nothing wrong with a Z3 but so long as you monitor your pool and ideally have spares to hand then Z2 should be fine. Obviously having a backup of your important data is a separate conversion.

the scrub finished, then the resilvering hung, saying it was now 4+ days

rebooted to see if that would help….

hung, had to reboot again

came up (yay) but now it shows the drive has 1 ZPS Checksum Errors

Scrutiny shows zero issues (except command timeout of 100, but it shows that on all my drives)

i can now get into the shell, and zpool status shows the drive is being resilvered and it is at 0.00% done; and now the ZFS Health page shows Resilvering at 0.00% as well, even thought it shows healthy on the Storage Dashboard. and it isnt budging off of 0.00%

This seems odd at best.

im thinking if nothing happens in the next hour or so another reboot is in order; something got out of sync somewhere (a bug, but no way to reproduce so….)

[edit] after some time, resilvering started, claiming 45 minutes. i can wait that. just not sure about this drive. i’ll run another long smart test when resilvering is done, but im not sure that will show whats going on here.

[edit2] now we are getting somewhere :slight_smile: zpool status shows one device has unrecoverable. error. Says ā€˜Determine if the device needs to be replaced’. maybe just easier to replace it now than wait for another episode