Quickly the relevant setup:
TrueNAS scale 25.04.2.3
Several pools made of only 1 drive eachs.
The pool with errors has one 3,5" disk of 18 TB.
On the pool I have just a folder filled of files with size of about 80 GB per file.
The reason for this setup is because I don’t really care about loosing data, I only want to know if one file is corrupted and in case replace it. In addition all the files are only for reading, once created it will only be read by applications.
This setup (with about 20 pools with the same structure) is working for several months. Every 3 months each pool is scrubbed and so far no errors were detected.
Yesterday I got one pool suspended because of errors. I tried to run a scrub but I’m not able to complete it because after a while some error occurs and I’m forced to clear it again and resume the scrub.
When I run sudo zpool status -v POOL_NAME
I get:
errors: Permanent errors have been detected in the following files:
<metadata>:<0x0>
<metadata>:<0x1>
<metadata>:<0x34>
<metadata>:<0x68>
<metadata>:<0x182>
<metadata>:<0x85>
<metadata>:<0x187>
<metadata>:<0x88>
<metadata>:<0x18b>
<metadata>:<0x18f>
Followed by the list of all my files (208).
This looks strange to me since how all files could be corrupted at once !?!?
Nothing noticeable happened to my system during these days…
In addition… When the files where generated I saved a simple csv with all the sha_256 of my files and by recomputing it for few files on that pool the sha_512 matches… I suspect it will match for all files.
That means that metadata are corrupted ? For all files ?
I was expecting that maybe one file could be corrupted after some time… but not all at once!
Is there some debug I could do to really understand what is going on ?
I have no problem in formatting the drive and recreating all files, but since I believe all files are still good I Would like to really understand what is broken…
I would scrub the pools more often, like monthly. My own non-redundant media pool used to give me failed sectors, until I scrubbed that pool every 2 weeks. It has now gone years without failed sectors.
As for the metadata errors. This should not happen. ZFS is pretty serious about protecting data, even on non-redundant single disk pools. As long as you have the defaults in your Datasets for:
DATASET redundant_metadata all default
then their should have been at least 2 copies of ALL Metadata, and for more critical Metadata data, 3 copies. Having more than 1 go bad in the space of 3 months does not sound right.
But, more regular scrubs and detection would have allowed ZFS to find any bad copy of Metadata and if another copy is good, automatically repair the bad copy, (during the scrub, without user intervention, just a note in the pool status error listing).
Device sector infant mortality rate. Sectors that were going to go bad early on, did go bad, and ZFS detected them. I manually restored the files, and all was good. Even appear to have gotten a failed sector in redundant Metadata because it auto-resilvered without permanent error like usual.
My original scrubbing was manual as when I remembered. So it could have been months, even 3 or more months, between scrubs. Thus, errors could have piled up. ZFS scrubbing causes both the mSATA SSD and the 2.5" HDD of the Media Pool stripe to detect sectors going bad. But the error detection & correction attached to the disk sectors allowed the sector to be fully recovered, and spared out.
As for which is right, I am leaning to number 2. But, checking those 2 devices now with SMART output, I can’t see much reallocated sectors.
My only “proof” is that this server and devices are probably 10 or 11 years old. And for practical purposes, on 24/7/365. So I would have thought they would have started failing more sectors. Yet they have not…
Power not for sure, I fixed that problems months ago by adding a second power supply. Now I get enough power on 5V line as well.
Cabling? I could check it… maybe I accidentally moved a bit the enclosure where the SFF 8643 to 4 SATA cables is connected…
Controller? bit harder to check, the other drives connected to it works fine… I also have additional fans to keep it cool. How would you check it out of curiosity ?
I guess I have all defaults… When creating the plot I just went with all defaults… How to check if I get the line you shared ?
I will also try your suggestion to decrease the scrub interval… Maybe 1 month… since each disk (and then pool) goes from 18TB to 22TB. It is 1 full day per drive to do a scrub…
Anyway thanks all for sharing feedbacks… In the meantime I can try swapping the drive in an empty slot which will be connected with a different connector, not sure if to a different controller but I can check (I have 2 LSI SAS 9300-16i boards and each board should have 2 controllers).
Will run a scrub and see what is the result…
I guess a way to tell if the the problem is on the drive would also be to connect it to a desktop pc (SATA port) with linux and run a scrub (I installed ZFS there). If no errors are found I can assume the problem is not on the drive, right ?