Using last Core, ran a scrub on new pool (z1 format, 12TB HD) with the following error, which, despite searching online, have not figured out how to address. Excerpt from zpool status -v:
errors: Permanent errors have been detected in the following files:
Is this the only error from zpool status -v ?
Seems odd that you would have this error an no other errors listed, unless you cleared them already using the zpool clear poolname command.
I fear you may have metadata corruption but let someone who knows more than I do about this topic chime in. If it is metadata corruption then the path for you is not a popular answer. Do you have a backup of your data? If not, grab what you can now. You may be destroying the pool and rebuilding. AGAIN, wait for confirmation just incase it is a different solution.
A listing of your hardware may help locate possible causes of the problem. Maybe an overheated HBA? And please provide the entire output of any commands, partial allows us to make assumptions and who wants an assumption as an answer.
This is a newly created pool of 3 12TB HD’s. As is my practice, a day or two after creating a new pool & copying data to it, I run a scrub. The initial scrub a corrupted file, which I then deleted from the pool & copied over from an external source & ran zpool clear. A new scrub then produced just the error I’ll try letting it sit for a bit then staret a new scrub. This is a consumer MB with an Inteli3-4130, 16MB non-ECC RAM, booting from a 60GB SATA SSD. All SATA ports are via the MB.
From what I read, I thought metadata corrumption a likely cause buut have not discovered how to confirm.
@jlpellet
As I understand it, typically a metadata corruption is caused by a hardware failure. I will not say that is always true, because I don’t know.
I highly recommend you run MemTest86+ on your system for several days to rule out a possibly memory issue. One complete pass is never enough, go for at least 5 complete passes. That is my criteria.
If that passes, run a CPU stress test for maybe 6 hours or longer. You want to heat up the motherboard and CPU, see if the system remains stable or some intermittent problems happen, indicating likely a poor solder job. I know of people who have run a CPU stress test for 1 month to validate the hardware. This happens likely more in a corporate environment where data is critical.
The fact that you are having several corruption errors so quickly makes me thing you have a hardware issue.
What are the HDD model numbers? Are they SMR drives?
Thanks. These HD are something of an experiemnt for me. They are certified server pulls with private labels & SN, advertised as CMR. The OEM smart data has been wiped.
My intent is to do the following:
After letting the drives run a few days, I’ll run another scrub. If it passes, just stop.
If it throws ewrrors, I’ll wipe the pool & move the drives to a different SATA controller with different cables, let it burn in & scrub again.
Thanks again for the insight.
John
I don’t think the physical drives at fault. It could be any other piece of hardware though. It may be much faster to boot up MemTest86+ (free) and let that run for a few days. I think you will make better use of your time. If it passes, you have ruled one thing out as not the problem. Again, at least 5 complete passes, one is not enough. If it takes 5 days, then it takes 5 days. What is your data worth.
I understand you are in an experiment, others who read this need to understand the importance of ensuring a solid hardware system exists. I know you get it.
Best of luck and I hope that whatever is causing the issues, you find it.
Some rounds of MemTest86(+) are definelty a good idead.
You error description above leads in this direction.
CPU (memory controller), Mainboard or RAM could have issues.
Thanks for the suggestions regarding hardware issues. I will pursue but as additional info, this system has been up for over a year without problems, running a 5-disk z2 pool. The only change was replacing the z2 5x8 TB pool with a z1 3x12 TB pool. I do have multiple data copies, both online & offline, so this is just experimenting with spare system. I do note the pool change used different MB SATA ports & cables, so, I would expect, if a hardware problem is the root cause it is more likely in the SATA controller & its interaction with the HD firmware. Also, as background, I was able to dig deeper on the HD’s & they seem to be Seagate Skyhawk disks, which are CMR per Seagate.
Thanks again for the help.
John
Just an update. As suggested, I exported the pool, rebooted the system, & imported it (by applying the saved config). After lettting the system run 24 hours, I ran another scrub, which just finished, with no errors. I’ll continue to monitor the system but see no reason at expect a problem. Thanks again for the suggestions.
John