Zpool scrub error fix?

jlpellet · July 6, 2025, 2:22am

Using last Core, ran a scrub on new pool (z1 format, 12TB HD) with the following error, which, despite searching online, have not figured out how to address. Excerpt from zpool status -v:

errors: Permanent errors have been detected in the following files:

    z2/Stores:<0xca25>

Any suggestions appreciated.

Thanks,
John

joeschmuck · July 6, 2025, 12:34pm

Is this the only error from zpool status -v ?
Seems odd that you would have this error an no other errors listed, unless you cleared them already using the zpool clear poolname command.

I fear you may have metadata corruption but let someone who knows more than I do about this topic chime in. If it is metadata corruption then the path for you is not a popular answer. Do you have a backup of your data? If not, grab what you can now. You may be destroying the pool and rebuilding. AGAIN, wait for confirmation just incase it is a different solution.

A listing of your hardware may help locate possible causes of the problem. Maybe an overheated HBA? And please provide the entire output of any commands, partial allows us to make assumptions and who wants an assumption as an answer.

QonoS · July 6, 2025, 12:35pm

Assuming you have backup:

You could try exporting & again importing your pool and see if the errors remains.

Otherwise zdb diagnostics might be your friend.

jlpellet · July 6, 2025, 3:19pm

This is a newly created pool of 3 12TB HD’s. As is my practice, a day or two after creating a new pool & copying data to it, I run a scrub. The initial scrub a corrupted file, which I then deleted from the pool & copied over from an external source & ran zpool clear. A new scrub then produced just the error I’ll try letting it sit for a bit then staret a new scrub. This is a consumer MB with an Inteli3-4130, 16MB non-ECC RAM, booting from a 60GB SATA SSD. All SATA ports are via the MB.

From what I read, I thought metadata corrumption a likely cause buut have not discovered how to confirm.

Thanks,
John

joeschmuck · July 6, 2025, 4:24pm

@jlpellet
As I understand it, typically a metadata corruption is caused by a hardware failure. I will not say that is always true, because I don’t know.

I highly recommend you run MemTest86+ on your system for several days to rule out a possibly memory issue. One complete pass is never enough, go for at least 5 complete passes. That is my criteria.

If that passes, run a CPU stress test for maybe 6 hours or longer. You want to heat up the motherboard and CPU, see if the system remains stable or some intermittent problems happen, indicating likely a poor solder job. I know of people who have run a CPU stress test for 1 month to validate the hardware. This happens likely more in a corporate environment where data is critical.

The fact that you are having several corruption errors so quickly makes me thing you have a hardware issue.

What are the HDD model numbers? Are they SMR drives?

jlpellet · July 6, 2025, 6:02pm

Thanks. These HD are something of an experiemnt for me. They are certified server pulls with private labels & SN, advertised as CMR. The OEM smart data has been wiped.

My intent is to do the following:

After letting the drives run a few days, I’ll run another scrub. If it passes, just stop.
If it throws ewrrors, I’ll wipe the pool & move the drives to a different SATA controller with different cables, let it burn in & scrub again.
Thanks again for the insight.
John

joeschmuck · July 6, 2025, 6:50pm

I don’t think the physical drives at fault. It could be any other piece of hardware though. It may be much faster to boot up MemTest86+ (free) and let that run for a few days. I think you will make better use of your time. If it passes, you have ruled one thing out as not the problem. Again, at least 5 complete passes, one is not enough. If it takes 5 days, then it takes 5 days. What is your data worth.

I understand you are in an experiment, others who read this need to understand the importance of ensuring a solid hardware system exists. I know you get it.

Best of luck and I hope that whatever is causing the issues, you find it.

QonoS · July 6, 2025, 7:31pm

Some rounds of MemTest86(+) are definelty a good idead.
You error description above leads in this direction.
CPU (memory controller), Mainboard or RAM could have issues.

jlpellet · July 6, 2025, 8:47pm

Thanks for the suggestions regarding hardware issues. I will pursue but as additional info, this system has been up for over a year without problems, running a 5-disk z2 pool. The only change was replacing the z2 5x8 TB pool with a z1 3x12 TB pool. I do have multiple data copies, both online & offline, so this is just experimenting with spare system. I do note the pool change used different MB SATA ports & cables, so, I would expect, if a hardware problem is the root cause it is more likely in the SATA controller & its interaction with the HD firmware. Also, as background, I was able to dig deeper on the HD’s & they seem to be Seagate Skyhawk disks, which are CMR per Seagate.
Thanks again for the help.
John

jlpellet · July 8, 2025, 2:24am

Just an update. As suggested, I exported the pool, rebooted the system, & imported it (by applying the saved config). After lettting the system run 24 hours, I ran another scrub, which just finished, with no errors. I’ll continue to monitor the system but see no reason at expect a problem. Thanks again for the suggestions.
John

Topic		Replies	Views
Pool [mypool] state is ONLINE: One or more devices has experienced an error resulting in data corruption in metadata. Applications may be affected TrueNAS General SCALE	8	333	July 3, 2025
Drive with just a few R/W errors TrueNAS General	10	133	November 18, 2024
Checksum Errors During Resilvering - Help Navigating TrueNAS General SCALE , ZFS	10	450	August 22, 2025
Pool is suspended and zpool commands hang TrueNAS General SCALE	33	2689	September 27, 2024
One or more devices has experienced an unrecoverable error. Not sure of cause TrueNAS General CORE	9	410	May 26, 2025

Zpool scrub error fix?

Related topics