Suspicious Disk Self-Test Log Errors

What !

Actually @WiteWulf has a different kind of problem. Multi-Report should work fine for the drives which roll over the 64K value, or in other words, no difference from how it currently works for you. Although I should have the tracking of drives having been tested working better.

1 Like

Joe, Your documentation is not bad. Actually i is a lot better than some I deal with.

3 Likes

Morning all!

Right then…first off: apologies to @joeschmuck, my comments were not in any way meant to demean the quality of your documentation, just my interpretation of them. Your documentation is robust and comprehensive, far better than a lot of FOSS!

I had a load more SMART long tests fail over the weekend, and some seem to have gotten “stuck” running for many hours. I also accrued 24 checksum errors on the disk that had previously flagged a read error.

This morning I shutdown and fully powered down the system, pulling the mains leads from the server and disk shelf. I reseated all disks and the SAS cable between the disk shelf and the HBA. The server booted up with no errors or warning lamps.

(As an aside: I thought I might as well run a memtest on the server while I had it down for maintenance, but was surprised to see this not included as an option on the grub menu. For an OS so reliant on ZFS, and memory integrity, I would have thought this was almost a necessity.)

Pool1 is now showing no topology or ZFS Health issues in the TrueNAS UI, and zpool status reflects that also. I’m running a scrub now on that pool just to be sure:

root@eurybia[/mnt/Pool1/homes/garyp]# zpool status Pool1  
  pool: Pool1
 state: ONLINE
  scan: scrub in progress since Mon Jan 12 09:56:20 2026
	11.1T / 38.0T scanned at 10.6G/s, 2.67T / 38.0T issued at 2.55G/s
	0B repaired, 7.03% done, 03:56:09 to go
config:

	NAME                                      STATE     READ WRITE CKSUM
	Pool1                                     ONLINE       0     0     0
	  raidz2-0                                ONLINE       0     0     0
	    ca05d5aa-54a6-40db-b4a3-4cc5dbc54077  ONLINE       0     0     0
	    6b398a80-7701-489a-89c7-32ef46778a63  ONLINE       0     0     0
	    1d77d204-4e2e-4a96-93c7-6d5eb890a22d  ONLINE       0     0     0
	    1f63f681-444e-4d1e-b1c0-3cd5868118b9  ONLINE       0     0     0
	    a077f95b-e2d7-49df-b4f5-9b8213a68bd9  ONLINE       0     0     0
	    f8f2fc2c-7030-463c-80f9-aec818154b21  ONLINE       0     0     0
	    5d735252-cf4b-4804-b7a6-441f9503070a  ONLINE       0     0     0
	    c86a5bfa-16c9-4384-9147-91e8f28a0c12  ONLINE       0     0     0
	    0fcac123-7533-4777-ac20-3c27f8fc242c  ONLINE       0     0     0
	    d97f4c42-0663-454b-81bc-79310c433431  ONLINE       0     0     0
	    3b0c4262-5606-4156-b41c-03d504a7fe09  ONLINE       0     0     0
	  raidz2-1                                ONLINE       0     0     0
	    ec7d3874-3c13-43fe-ae6a-0e56ace5cdcb  ONLINE       0     0     0
	    ec91f07e-a21a-43cd-8283-adc2bcc04c0a  ONLINE       0     0     0
	    afc7b971-e3e2-4f99-b40f-a059648d61c9  ONLINE       0     0     0
	    660934ad-af7e-4f43-911f-7181a5a3ad58  ONLINE       0     0     0
	    0b2bff21-acdb-446a-8036-c5cd1c5b5d2a  ONLINE       0     0     0
	    91312505-40cb-45d6-9068-d73aea270852  ONLINE       0     0     0
	    067d7be6-d336-42de-8b29-b8a129c882fd  ONLINE       0     0     0
	    94d0d685-0ffa-4828-a6c2-2b725044db59  ONLINE       0     0     0
	    fec62dd4-f536-4ade-97d1-11f334684e6a  ONLINE       0     0     0
	    633fd2d6-2e43-4098-8b3d-8c2297a9d656  ONLINE       0     0     0
	    295074ff-0941-409a-84fe-2607a800e24e  ONLINE       0     0     0
	  raidz2-2                                ONLINE       0     0     0
	    da683169-268b-42a3-89f1-4250dbdc910f  ONLINE       0     0     0
	    2df44b77-2c11-41cc-802c-6e21f524f82e  ONLINE       0     0     0
	    8273cbcd-97c7-4452-8c39-74eb032b7874  ONLINE       0     0     0
	    eb678a85-09c4-4e6f-a106-9cae17781890  ONLINE       0     0     0
	    65d90bdf-f7a2-4701-9d99-e1012fc0a9b6  ONLINE       0     0     0
	    5db54936-ac13-4343-8c0a-df702858f42c  ONLINE       0     0     0
	    a3ee9eab-f49b-484d-a78e-b46119f8caf1  ONLINE       0     0     0
	    0c5a32c6-5e5c-4345-9516-9d02f2b0c791  ONLINE       0     0     0
	    94c36df2-afd3-4aa6-96f4-01b8c1e890ce  ONLINE       0     0     0
	    d4429931-8911-4784-947a-887c28f1d23e  ONLINE       0     0     0
	    046bd705-222a-4201-9368-52dd00390cfb  ONLINE       0     0     0

errors: No known data errors

I don’t want to run a SMART test on one of the disks that’s part of the pool being scrubbed, so I’ve issued one on a spare disk that’s in the same disk shelf. The scrub should finish in about 3.5hrs, and the SMART test should finish in around an hour. I’ll update with the results later.

Multi-Report, if you set the SCRUB_Minutes_Remaining=60 to 1 minute, then it will not run a SMART Long test on a pool performing a SCRUB, however it will run a SMART Short test if a Long test was scheduled. The 0 value does not work properly, where no SMART tests will be run at all. I still need to fix that issue.

Of course, the default setting will not allow any SMART tests during a RESILVER on a pool.

As for the documentation, I personally can see problems. Maybe I’m too critical of myself? But I’d like to make it better. My plan is to make the vext version GUI configuration more obvious so the docs do not have to be extremely detailed.

I am glad cycling power “seems to” have worked. I hope all your testing passes. You just had too many identical problems across many drives, that it very odd.

1 Like

Yeah, so all done and I’m happy it’s working properly now.

The long SMART tests on the two unused disks in the disk shelf passed, the scrub completed with no errors, and long tests on two disks that previously failed repeatedly have now also passed.

This computer obviously got very upset about something, a warm reboot didn’t sort it, but a full cold boot of all systems (including the disk shelf) seems to have done the trick. What precipitated the problems? I honestly can’t say, but I’m still very suspicious of those power on hours counters all being at 65536 when this first started.

2 Likes

While I would say that this would be the first time a POH counter has caused such an error, I would not rule it out. It could have been a combination of the NETAPP and drives. I know nothing about the NETAPP hardware.

But I am glad to hear it all appears to be running well again. I hope it stays that way.

1 Like

Thanks for all help and advice guys, much appreciated as always.

@joeschmuck

Do you know if the actual rollover process of some disks at 65K is the cause of the issue or is it because POH is suddenly zero, or just if the some disks pass the 65K mark they can’t handle larger numbers?

I have some HGST disks with a lot more hours than 65K with no problems. best I can tell the counter never rolled over

@PhilD13

As I understand it, the manufacturer did not account for the longevity of the drive and the counter is only 16 bits wide so it rolls over.

We see this in the POH counters, SMART Self-Test POH counter, and those are the ones I have seen. A drive may have one of these issues, both, or none.

In Multi-Report, I try to adjust for this as it can cause issues when tracking the testing. Imagine POH rolls over to 293 hours, but the SMART last test hours is 65500. That becomes a Warning issue for the Test Age. However when I see the SMART POH is greater than the POH, then I add 64K to POH and do the math. This only lasts as long as there is a discrepancy like this. If the SMART POH does roll over, then things are normal again as far as the script is concerned. If it does not roll over, then the 64K remains being added to the POH for the math part of things.

1 Like