So, let me explain something here… Your failure being reported is a ZFS failure, not a hard drive failure.
While a hard drive failure can cause a ZFS failure, it often is not the case, it can be caused by the system crashing, a sudden reboot or power issue.
Two things:
Run zpool clear tank to clear the error at hand.
Run a SMART Long test on the correct drive. Once that completes you can post the output of the results. Or you could use what I provided earlier to diagnose if you have a problem.
If the drive is faulty, then replace the drive. If the drive looks good, then keep an eye on it. If the same type of failure occurs on this one specific drive, it is time to troubleshoot and run another LONG test.
My strong advice, setup a daily Short test for all your drives. This is a simple 2 minute maximum test. And setup a weekly Long test for all your drives. Having 12TB drives (just over 18 hours to test a drive), I would recommend one drive a day, maybe Mon, Tue, Wed, Thu, Fri for example. I would run the Short tests at say 2200 (10pm) and then run a Long test at 2215 (10:15pm). This is an easy setup, and just an example. Change the dates/times to whatever works for you.
How would that impact your system? With the RAIDZ2, it should not impact your system at all. SMART tests always are the lowest priority so any data requests are bumped to the top and fulfilled. When the drive is idle, the testing resumes. Spreading your the Long tests reduces heat build up as well, especially if you have the drives next to each other with very little gap for air.
Interesting that you saying this is ZFS not the physical drive, how would that then be localised to just the one drive ?
The original log was a short run… on the wrong drive… which i picked up when i ran the long check last night… and used beyond compare this morning, file side by side… sorry…
I currently have 3 pools, so I stagger short test during the week, I don’t normally run the long test.
I already have a replacement drive on the way, not sure how much stress I want to put on this drive… if it might be the cause of the problems, as it’s only being reported on it…
My worry atm is ye, RaidZ2 is saving my @ss atm… as long as this drive is ok, and there is not another drive with problems, as I replaced all 5 drives from the same supplier.
ZFS is the format in which the data is written. There was some sort of data corruption caused by possibly a system crash (it happens). And then localized to a single drive is dependant on where in the write process the system is. I’ve seen 1, 2, + drives with a ZFS corruption. The key indicators are the ZPOOL results and if you have no drive failure indicators. So far you have not shown a single drive failure indicator.
‘SCRUB Failure != SMART Failure’ (!= not equal)
That is a mistake in my opinion. You should be running Long tests as well. The Short test just verifies a few locations on the drive and the drive electronics are working. The Long test reads all the sectors on the drive, meaning anyplace you would put data. However, I will not force the issue, only you know how you want to run your system and how far in advance of a hard drive failure you would like to be warned.
I don’t understand. You typed the command I provided above and it kicked off a resilver?
Hope the Long test you run comes back with no errors. You now have the tools to decipher if a drive is throwing errors on not.
understood, will put a standard schedule in place. think with my plan to collapse the 2 disk pool into one wider will also make this easier.
going to grow the 5x Raid2z to a 10 drive… and remove the bunker disk pool.
will then see how long a long test takes and basically move drive to drive, and cycle through the set…
yes…
the drive itself was showing totally faulted.
will kick off a long test on the drive tonight again and advise.
A long test for the 12TB drive will take 1085 minutes (~18 hours 6 minutes), longer if the drive is doing something such as routine NAS work. If you can, kick off the Long test as soon as the resilver has completed.
Yes, that drive is toast. So this was the first actual drive failure notification you received.
This is very odd to fail a Long test with so few hours, but it proves Infant Mortality still exists and the reason everyone should perform some sort of burn-in test.
This wasn’t a refurbished drive was it? I doubt it, I checked the warranty and it looks good.
If you haven’t already, run a Long test on your other drives. Remember, this is the lowest priority so any transactions you need the NAS to do, it will happen without delay. But if you have the drives packed tightly together where they could build up heat, don’t test adjacent drives to limit the heat buildup. But if you have proper cooling, that will not be an issue.
I’d argue that there is no point in doing so as the smart failure is already clear & immediate RMA territory. All badblocks will do is delay this by apprx a week while you wait for it to finish.
I’m saying that it should be rma’d asap. I wouldn’t personally bother marking anything at all because it is a failed product that is under warranty & should either be returned per merchant agreement for a like for like replacement, or open an RMA with Seagate to get a placement as drive is showing in warranty until 2026.
Either way should get you a working drive instead of a degraded/paper weight drive.