Actually @WiteWulf has a different kind of problem. Multi-Report should work fine for the drives which roll over the 64K value, or in other words, no difference from how it currently works for you. Although I should have the tracking of drives having been tested working better.
Right then…first off: apologies to @joeschmuck, my comments were not in any way meant to demean the quality of your documentation, just my interpretation of them. Your documentation is robust and comprehensive, far better than a lot of FOSS!
I had a load more SMART long tests fail over the weekend, and some seem to have gotten “stuck” running for many hours. I also accrued 24 checksum errors on the disk that had previously flagged a read error.
This morning I shutdown and fully powered down the system, pulling the mains leads from the server and disk shelf. I reseated all disks and the SAS cable between the disk shelf and the HBA. The server booted up with no errors or warning lamps.
(As an aside: I thought I might as well run a memtest on the server while I had it down for maintenance, but was surprised to see this not included as an option on the grub menu. For an OS so reliant on ZFS, and memory integrity, I would have thought this was almost a necessity.)
Pool1 is now showing no topology or ZFS Health issues in the TrueNAS UI, and zpool status reflects that also. I’m running a scrub now on that pool just to be sure:
I don’t want to run a SMART test on one of the disks that’s part of the pool being scrubbed, so I’ve issued one on a spare disk that’s in the same disk shelf. The scrub should finish in about 3.5hrs, and the SMART test should finish in around an hour. I’ll update with the results later.
Multi-Report, if you set the SCRUB_Minutes_Remaining=60 to 1 minute, then it will not run a SMART Long test on a pool performing a SCRUB, however it will run a SMART Short test if a Long test was scheduled. The 0 value does not work properly, where no SMART tests will be run at all. I still need to fix that issue.
Of course, the default setting will not allow any SMART tests during a RESILVER on a pool.
As for the documentation, I personally can see problems. Maybe I’m too critical of myself? But I’d like to make it better. My plan is to make the vext version GUI configuration more obvious so the docs do not have to be extremely detailed.
I am glad cycling power “seems to” have worked. I hope all your testing passes. You just had too many identical problems across many drives, that it very odd.
Yeah, so all done and I’m happy it’s working properly now.
The long SMART tests on the two unused disks in the disk shelf passed, the scrub completed with no errors, and long tests on two disks that previously failed repeatedly have now also passed.
This computer obviously got very upset about something, a warm reboot didn’t sort it, but a full cold boot of all systems (including the disk shelf) seems to have done the trick. What precipitated the problems? I honestly can’t say, but I’m still very suspicious of those power on hours counters all being at 65536 when this first started.
While I would say that this would be the first time a POH counter has caused such an error, I would not rule it out. It could have been a combination of the NETAPP and drives. I know nothing about the NETAPP hardware.
But I am glad to hear it all appears to be running well again. I hope it stays that way.
Do you know if the actual rollover process of some disks at 65K is the cause of the issue or is it because POH is suddenly zero, or just if the some disks pass the 65K mark they can’t handle larger numbers?
I have some HGST disks with a lot more hours than 65K with no problems. best I can tell the counter never rolled over
As I understand it, the manufacturer did not account for the longevity of the drive and the counter is only 16 bits wide so it rolls over.
We see this in the POH counters, SMART Self-Test POH counter, and those are the ones I have seen. A drive may have one of these issues, both, or none.
In Multi-Report, I try to adjust for this as it can cause issues when tracking the testing. Imagine POH rolls over to 293 hours, but the SMART last test hours is 65500. That becomes a Warning issue for the Test Age. However when I see the SMART POH is greater than the POH, then I add 64K to POH and do the math. This only lasts as long as there is a discrepancy like this. If the SMART POH does roll over, then things are normal again as far as the script is concerned. If it does not roll over, then the 64K remains being added to the POH for the math part of things.