ZFS Metadata - How it works, plus discussion on corruption

Something has been happening here occasionally, ZFS Metadata corruption.

The tl;dr is that ZFS Metadata corruption should not be occurring unless the user has a non-redundant pool AND has changed the default of redundant_metadata=all. Or has bad hardware. Certainly not at the rate we seem to be seeing here in the TrueNAS forums.


Some history and design notes.

Sun seemed to consider standard Metadata, like directory entries, more important than regular data. This makes sense in that a single bad disk block in the directory tree could take out an entire file. Having 2 or more copies of this standard Metadata allows that bad block to be worked around. Even repaired! On a non-redundant pool!!!

Note that this extra copy of standard Metadata is addition to any vDev redundancy in the pool. For example, a simple 2 way Mirror, would end up with 4 copies of standard Metadata, 2 per Mirror device.

Their is a ZFS Dataset property that can reduce the overhead of this Metadata. In general I can’t see many people changing this from the default. See this manual page and the redundant_metadata entry for details:
https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops.7.html#redundant_metadata

In regards to critical Metadata. By default their are 3 copies, again because loss of critical Metadata could impact much more than a single file.

One last design note on ZFS Metadata. If the pool consists of more than 1 vDev, the extra copies are spread out. So, a 2 vDev 2 way Mirror, would have a standard Metadata copy on each 2 way Mirror. Besides the redundancy effect of spreading the Metadata data around, this causes a more balanced usage of the vDevs.


So how do we get ZFS Metadata corruption?

It may seem simple, power outage caused it. Lots of people in recent years seem to have pool import problems immediately after a power loss. But it is wrong to blame the power loss directly:

On other file systems, potentially yes, an OS crash or power loss could corrupt a file system. But, ZFS was specifically designed to avoid this problem. Data is either fully written, or not. No in between.

So, back to how?

The known causes are these:

  • Non-ECC RAM caused a bit flip after the Metadata block was created & check-summed in RAM, but before it was written to storage, (which affects both copies).
  • Some LSI HBAs can seem to write corrupt data when they overheat. (Note that wording, “can seem to”…).
  • Hardware RAID controllers that do elevator seeking & writing, AND a power loss occurs during such an event. (Thus, “it worked for years without problems!!!” But, of course it did, this is a rare event!)
  • Power supplies that are on the edge of reliability.
  • Multi-disk USB enclosures may have hardware RAID controller chips, though generally reduced functionality.
  • Use of USB attached storage which might have firmware or logic bugs that cause problems.
  • It appears, (again note the word “appears”), that SATA Port Multipliers are not well supported software wise. So that perhaps they can cause problems too.
  • And the rare, but possible ZFS bug

I can’t think of any more at present. Add comments if you think of a real potential cause.


A Guess on one cause

My personal belief, without real evidence, is that SOME, (and I truly mean a few, definitely NOT all), ZFS Metadata corruptions could be caused by transient memory errors. Here in the forums we have quite a few people running with Non-ECC RAM, so it is possible. Remember, I said that I don’t have real evidence, just a hunch.

Part of the reason I say this, is that the Enterprise users don’t appear to have the same Metadata corruption problem. Otherwise there would be a lot of complaining on that side.

With TrueNAS being one of the bigger free small business & home NASes, that uses ZFS, this makes a some sense. Some people are building their TrueNASes with consumer hardware that does not have ECC RAM support. Even when system boards do support ECC RAM, this also requires the user to select a CPU with such support AND buy ECC RAM. Then hope the BIOS implements ECC RAM correctly.

Server grade hardware would, I assume, have fully tested ECC RAM.


Personal experience with Metadata redundancy

Something odd happened to me a few years ago. My low power, miniature media server has 2 storage slots, one a mSATA SSD and the other a 2.5" disk bay. I installed a 1TB mSATA SSD and a 2TB HDD in this computer. Because the OS was going to be small compared to the media, I took a 50GB piece from each and made a Mirrored root pool for the OS.

However, with good backups I did not see the need to have redundancy on my media. So I striped the remaining space of the 2 storage devices. Occasionally over the years I lost a video file, which was statistically more likely due to their size. ZFS Scrubs told me which file and I would restore it from backups.

One day I noticed a read error, with short resilver, (if I remember correctly). But this was not accompanied by file name. In fact, the pool stated errors: No known data errors. I puzzled over this for a while. Eventually I figured it was likely that the block was in redundant Metadata.

In someways I wish I had better reporting from ZFS. Maybe ZFS does log these types of errors. But, that was a one time event that I did not investigate more thoroughly,


Afterword

Your thoughts?
Any useful info to add?
Suggestions on things I should fix?

5 Likes

I think there must be the link.

I think you meant – after the metadata block was created but before it was check-summed and written to storage.

Another causation vs. correlation issue might be the quality of the PSU, UPS, etc. Folk willing to spring for ECC systems likely have more server-grade equipment as well - with the right cooling for the HBA, for example.

This is likely a unknowable problem short of having backblaze-like epidemiological datasets that feature thousands of systems and some sort of control for figuring the actual reason for a bit flip, overheated HBA writing garbage, power loss, etc. :grimacing:

I don’t believe any had UPS systems

The latest case had a flaky UPS. (And a stripe pool with an ASM1166 controller.)

Somehow it would be comforting if all these cases derived purely from “bad” hardware—except for the Linux crowd which would then be told to stick strictly to server-grade components instead of throwing any hardware they have at hand because “Linux has a driver for it”.
But some people had already been using non-ECC RAM, the occasional SMR drive and SATA controllers. with FreeNAS/TrueNAS CORE, and I don’t remember we ever had such issues—or certainly not so often.

And how can ZFS fail to update labels on some drives in a pool for hours, if not for days? :scream: Purely from hardware failure? Without the pool being faulted?

“Eliminate the impossible, and what is left, however improbable, must be the truth.”

3 Likes

Done, added link to OpenZFS zfsprops manual page.

No, I did not. If the bit flip occurred before the checksum, ZFS would not have reported it as an error, because the checksum would say that it is good. Then attempt to use the “bad” data as normal, perhaps causing a crash. Those error conditions probably happened too.

2 Likes

Again UPSes are not required for ZFS on-disk consistency;

Of course, UPSes prevent corruption from poorly designed in hardware. Like hardware RAID controllers which uses elevator seek & writes.

Yeah, you’re right. My bad.

Hmm, I wonder if the Linux storage driver is holding back writes of certain data because it sees that it is constantly being updated. So it wants to optimize out those extraneous writes.

Now it might not be the SD driver but the lower level device driver that is not doing the correct thing.

To be clear, I have seen this type of excrement done elsewhere. When a perfectly functioning concept is corrupted into supposedly better and faster method, that does not work properly when the rare case happens.

1 Like

I’ve had weird issues crop up due to a bad PSU. I have no doubt a bad UPS could make the PSU fail also.

For example, years ago one of my PSUs was fried due to incessant power interruptions. A bad UPS might manage that also.

A lot of factors go into component longevity and I was really surprised when my Seasonic 750W, Ti grade PSU started failing and creating all sorts of weird errors even as the maximum system load was 120W, there were zero voltage spikes, a UPS was there to prevent brownouts, and so on.

But it happens. In my case it thankfully did not lead to corruption.

Just some general thoughts/personal opinons.

I’m not a UPS evangelist, but this isn’t quite the whole story.

One point I will say in this regard. From a hardware perspective, regardless of software safety belts, weird things can happen.

One example, If you have a disk flying around at 7200RPM and the read/write head is seeking back and forth alot…and…suddenly…the power cuts, bad things can happen. Hard drive manufacturers are of course aware of this and have capacitors and things in place to help stop the surge from exploding things…but they can only work so well given the engineering constraints of the 3.5" form factor.

Then there is a fairly common practice of using non PLP-protected flash SSDs, SATA of all form factors exist without PLP, and most M.2 2280 NVME don’t have PLP. These drives can exhibit undefined behaviors, especially when very busy and then have a sudden cut in power.

Software cannot fix physics, unfortunately. When you build a NAS with consumer hardware, the story goes alot deeper than “ECC” being an issue.

and just to be clear, I am not in any way saying you should NOT build a NAS with consumer hardware. I am merely saying that in doing so, you are at a higher level of risk for undefined behaviors.

Even the best hardware designed for 24/7 operation can have fatal flaws, take a look at the Atom C2000 bug situation a few years back. The Intel Atom C2000 Series Bug - Why it is so quiet

I’m just saying, in general, these sorts of corruption issues (or at least the ones I have looked at specifically) can very much be explained by some sort of hardware level issue, and some of the time, it coinsides with a power failure event.

1 Like

I wonder what kind of errors there were and how you found out that it was the PSU.

I’ll have to look for that in the old forum. Hang on a sec.

Your wish is my command. There is a follow on titled descent into unhappiness as electrical gremlins made ZFS declare my pool dead. So fun.

2 Likes

Failing PSUs can cause issues regardless of the power draw. All the sorts of internal compensation circuits and capacitors and stuff would all be sized relative to the maximum draw of the power supply.

You can, however, find a lot of crappy power supplies out there. When I was like 18 or 19 I had one that caught on fire, blue magic smoke with that electronics smell, and ever since I stopped buying cheap-o ones.

I generally really like Seasonic and Corsair ones because they are generally available and good quality, even on their lower end offerings. (This is not a paid promotion or an official endorsement of those companies or something, I just like their stuff personally)

I miss Jonnyguru. I am a nerd :slight_smile:

2 Likes

There is a PSU tier list, btw. Can’t say anything about its reliability…

1 Like

Yes, there are plenty of these that exist out there still. LTT has one thats pretty actively maintained, as another example. https://www.lttlabs.com/categories/power-supplies

But the moral of the story is, if you are delivering unstable voltage to a drive (regardless of solid state vs mechanical), or PCI-E cards (HBAs) or CPU/RAM…

Weird things can happen. Again. I am not saying that weird things can’t happen in software, just that the examples I have looked at personally all seem to be hardware related in general.

Insert relevant old thread: Proper Power Supply Sizing Guidance | TrueNAS Community

Meta-Issues To Consider

You’ve probably come to FreeNAS for its awesome data protection and storage resiliency features. You’ve hopefully learned that you want server-grade gear, and that you want ECC memory, and that you want redundancy in your storage system. But another question you should ask yourself, how long do you want your storage system to last? Most users are looking to create a storage platform that won’t be obsolete next year, and in fact usually want it to last as long as it can.

Your power supply ends up being one of the most complicated bits of electrical engineering in your system. You want to pick a power supply that has high quality components, because a failure of a component could mean anything from power loss to voltage sag to high voltage being fed through to your low voltage computer parts. These are very bad things!

Further, because we want the power supply to work as well in five (or even ten) years as it does today, we have to consider that as components age, their ability to perform slowly degrades. This degradation is made worse if a component is stressed out to near (or past) its specification. In the world of electronics, we typically cope with this using a principle known as derating. This simply means that, for example, if you needed a supply that can deliver 300 watts, you get a 400 watt supply. A typical high quality modern power supply delivers a fairly consistent level of efficiency when loaded between 20%-80% of its rated capacity, so you’re not “saving lots of power” by getting a 300 watt supply for a 300 watt load. The rule of thumb in the shop here is that a power supply should never be pushed beyond 80% of its rated capacity.

@Arwen thanks for the posting. Metadata corruption of course has been on my mind as of late due to the uptick in postings about metadata corruption.

I suspect most of the metadata problems are due to not using good hardware, most specifically not using ECC RAM, but we know there has been an ECC case recently as well. Was it bad RAM? I don’t know and didn’t follow that thread.

While I have never had one of these issues, and hope to never have one, it would be nice to know what people can to do mitigate the possible metadata corruption from happening. I’d rather add another metadata record than remove one, if that were a prudent thing to do.

I’m glad there are quite a few of you folks out there that understands ZFS. I have a basic idea but I do not “work” with it very often so my knowledge level is low.

1 Like

My original point stands.

However, having personally experienced a brownout lasting minutes, (which happened to me just a few weeks ago), could have serious problems for a computer and it’s power supply. With an UPS, my computer, monitor, switch and router all stayed up and running without apparent problems.

I’ve added Power Supplies as a potential cause of Metadata corruption.


What I don’t get, and it is partly the purpose of this thread, is why we get complete Metadata corruption as often as we do. This means it has to impact both copies, AND any redundancy built into the pool.

Perhaps what we need is a diagnostic script to run against a pool that identifies the Metadata corruption. For example, knowing a user was modifying the following path at the time of the power loss:
/mnt/my_pool/my_dataset/dir/file
Then finding that the Metadata that is corrupt is for:
/mnt/my_pool/my_dataset/dir
makes the problem more understandable.

Not that this should happen. But, perhaps a bug was introduced into OpenZFS in when the 2 copies of the Metadata are written. It should be in this order:

  1. Data, if any
  2. Standard Metadata
  3. Critical Metadata, (like free block list)
  4. Uber blocks

Write barriers on the disk are supposed to guarantee that data is written in this order.

Perhaps we need to look at the storage devices. Maybe some implement so called “optimizations” for writes, that ignore write barriers. This could explain some of the Metadata corrupts at power loss events.

Or perhaps we need to re-look at Linux. I don’t recall this many Metadata corruptions on the FreeBSD version of TrueNAS, (Core). It is possible that we either have a lot more users because TrueNAS now has a Linux version. Thus, more statistically likely for Metadata corruption. Or perhaps there is a serious bug in Linux, that only rears it’s head at power loss or OS crashes.

2 Likes

We went from “can’t hardly remember any such case” to having a new case pop up on a weekly basis. I don’t think that the user base increased by a factor that is commensurate with this.

@HoneyBadger suggested it might be linked with single vdev pools, and relative lack of redundancy for pool-wide metadata. In one case with multiple drives having bad SMART reports and multiple label mismatches, he proposed a scenario where drives would drop out one at a time, fall behind, come back online and start to catch up but then another drive had dropped… and so on until so many drives were falling behind that the raidz2 could no longer cope.
But the latest case involves an 8-wide stripe; 7 seven drives are in sync, including one that is possibly failing, the 8th drive is 5814 txg ahead of the rest. So here no drive could have dropped without the pool being faulted—and the user noticing. No data redundancy, but multiple vdevs to spread pool-critical metadata. There is a SATA controller, apparently without port multiplier, but it controls only four drives, so at least three of the seven which are lagging are NOT on it; it’s not the ASM1166, or at least not only this SATA card. And ZFS would have written data to one single drive for over seven hours without touching any of the other drives in the pool? :roll_eyes:
If it’s an OS driver doing “optimisation” on its own, it is seriously sneaky to hold label updates on 7 drives, 4 distinct locations on each so 28 different locations in total, for more than seven hours without flushing the backlog.

2 Likes

Hmm, I think we need to create a procedure that anyone can run, to gather data on a non-importable pool.

Perhaps an add-on tool, that is run from the command line, and not integrated into the GUI. (It’s too late for CE 25.10…) Somewhat like Sun Solaris’ Explorer, which was copied by HP and later an open source was made. This can be leveraged both by iX / TrueNAS staff, and the community to assist users with non-importable pools.

It might take weeks / months to make the tool quite usable. We might even consider it a permanent work in progress. Put it on GitHub, allow people to make Issues against it, even suggested Pull Requests. However, since it will eventually go into CE 25.10.1, iX / TrueNAS should likely approve what it does.

So, good idea?
Too complicated?
Not easy enough to do?

While I am willing & able to create GitHub Issues & Pull Requests against such a tool, I don’t think I am up to spear heading such a development project.