Pool [mypool] state is ONLINE: One or more devices has experienced an error resulting in data corruption in metadata. Applications may be affected

Hello there,

in my pool8drives pool, a 3 x MIRROR | 2 wide 3TB SATA drives (see signature for the details) one of the drives failed.

So I replaced with another SATA drive but of 4TB capacity (yes, I flagged the Treat Disk Size as Minimum checkmark when I created the pool).

The replacement went okay, all the drives appears online and available except for the error TN Scale is reporting in all the drives (thei error wasn’t present before the drive replacement, even with the drive failed).

Pool status

What would be the best fix or approach solution to this

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x3d>

Thanks!

I don’t think you are going to like the answer, unless there is something new that I’m unaware of, and that could be true.

typically a metadata error would require you to destroy your pool and rebuild it, then restore a backup of your data.

Wait for someone like @etorix @HoneyBadger or @Arwen to comment, anyone else but me.

If you do have a backup, you should be fine, if you don’t, I’d try to copy all the data I could before you have to take drastic measures.

Best of luck

Have you performed a pool scrub since replacing the failed drive?

Pool defaults include redundant_metadata=all so unless it’s been corrupted across all three vdevs it should be able to rebuild.

When set to all , ZFS stores an extra copy of all metadata. If a single on-disk block is corrupt, at worst a single block of user data (which is recordsize bytes long) can be lost.

3 Likes

I don’t think you are going to like the answer,

Probably not, but it’s the same as said here
https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A/

It’s not the first drive I replace (in my other TN NAS) but in this case I shocked how a simple drive replacement had a so tragic outcome.
Is it that common to toast a pool just replacing a drive?

What about if I try to replace the metadata damaged file, supposed I’m able to find it?

I do have a bunch of snapshots (on same NAS) and configuration file backups (elsewhere)

Nope, just resilvered as part of the drive replacement procedure

Yes, I am surprised that a server grade system, (AMD Epyc with ECC RAM, and LSI HBAs), would have this problem.

No, this is way outside of normal.

However, I would suggest running a days long memory test. And if you can’t take that much down time, then at least hours long.

There are potentially 2 causes, over-heating of HBA controllers, which could easily affect all disks attached. Second, a memory error that caused a bit flip in ZFS Metadata, which when written twice to different vDevs, both would be corrupt.

If an error clear & scrub does not make the error go away, then perhaps rolling back to an earlier transaction might help. But, then again, their may not be any good ZFS transaction group to roll back to.

3 Likes

Initiate a scrub from the Storage tab in the ZFS Health pane:

It did!

I firstly cleared, and the checksum errors disappeared

but the metadata error remains.

So I launched an overnight scrub, and all returned to normal

So everything sound normal now…

Big thank to Arwen, HoneyBadger, joeschmuck for your inputs!

2 Likes

Great, glad everything is good now.

This does tell us something. The likely cause was a HBA / disk controller problem, perhaps over-heating. If either HBA seems to have restricted air flow, you might look to add additional cooling via internal fans.