ZFS Metadata - How it works, plus discussion on corruption

Constantin · August 3, 2025, 5:18pm

I put in a feature request along the same lines a while ago and it was not taken up. I concur that better your-pool-couldn’t-be-imported-advice than “your pool is dead, rebuild from backup” would be great.

SmallBarky · August 3, 2025, 5:46pm

I don’t think Honey Badger even knows what is causing this. He has been on plenty of thread in diagnosing and, attempting, to get the pools back.

We run a lot of commands just to try to sort out the problems from the beginning. I don’t see a tool being made to assist unless it is just some sort of script that captures all the diag commands and puts it all into a file to review.

Take a look at @protopia and the standard list of commands used for problems.

Arwen · August 3, 2025, 9:34pm

Yes, that was what I intended, a single script to gather all the information needed for pool import failure diagnoses.

The way Sun’s Explorer worked, is that it ran individual commands and captured the output. Sometimes running the command a second time, but with different options. When all the commands have been run, it would TAR up and GZip the output.

nasmix · August 9, 2025, 12:02pm

Hi!

If you use replication tasks there was an bug in OpenZFS which can cause metadata corruption. Here is the Link from GitHub:
Fix 2 bugs in non-raw send with encryption by gamanakis · Pull Request #17340 · openzfs/zfs · GitHub

I have an Z3-Pool, ECC-Memory, UPS…and I encounter metadata corruption.
Here is the issue in the bugtracker:
[NAS-135455] ZFS metadata corruption with enrypted ZFS pool and send/receive - iXsystems TrueNAS Jira

If your problem seems not familiar, please ignore his post .

Constantin · August 9, 2025, 2:01pm

Reading through this long chain of messages, it’s not clear to me if the errors are limited to just the receiving NAS getting sent corrupted metadata or if the sending machine also suffers from corrupted metadata.

The good news is that the openZFS community / Linux devs are aware of the problem, what is less clear is what process they have made since April addressing the issue.

Stux · August 9, 2025, 6:23pm

AFAICT, that bug is related to zfs crypto.

Certainly, that is a good datapoint to capture from “cases” in the future.

And although it is in regards to snapshots and replication, iirc, it was affecting the source.

littlenewton · August 14, 2025, 3:39am

Thank you for posting such a deep thought about ZFS metadata!

Personally, I prefer ECC memory rather than high-frequency overclocking memory. Memory has to be very solid/stable in one system.

In the huge 128-byte ZFS blkptr_t, the copies can be set up to 3.

typedef struct blkptr {
	dva_t		blk_dva[SPA_DVAS_PER_BP]; /* Data Virtual Addresses, it is 3 */
	uint64_t	blk_prop;	/* size, compression, type, etc	    */
	uint64_t	blk_pad[2];	/* Extra space for the future	    */
	uint64_t	blk_birth_word[2];
	uint64_t	blk_fill;	/* fill count			    */
	zio_cksum_t	blk_cksum;	/* 256-bit checksum		    */
} blkptr_t;

Thanks to this design, it allows me to use ZFS safely on a single disk system.

squirrel627 · August 14, 2025, 9:08pm

There is an exception to this. If you’re using ZFS native encryption, you only get 2 copies. Where you would have the third DVA is reserved for the block’s cryptographic IV and salt.

You can see that code in the Storage Pool Allocator here:

github.com/openzfs/zfs

include/sys/spa.h

master


      
          * phys birth	txg when dva[0] was written; zero if same as logical birth txg
          *              note that typically all the dva's would be written in this
          *              txg, but they could be different if they were moved by
          *              device removal.
          * log. birth	transaction group in which the block was logically born
          * fill count	number of non-zero blocks under this bp
          * checksum[4]	256-bit checksum of the data this bp describes
          */
          
          /*
          * The blkptr_t's of encrypted blocks also need to store the encryption
          * parameters so that the block can be decrypted. This layout is as follows:
          *
          *	64	56	48	40	32	24	16	8	0
          *	+-------+-------+-------+-------+-------+-------+-------+-------+
          * 0	|  pad  |	  vdev1         | pad   |	  ASIZE		|
          *	+-------+-------+-------+-------+-------+-------+-------+-------+
          * 1	|G|			 offset1				|
          *	+-------+-------+-------+-------+-------+-------+-------+-------+
          * 2	|  pad  |	  vdev2         | pad   |	  ASIZE		|
          *	+-------+-------+-------+-------+-------+-------+-------+-------+

There was an issue awhile back (caused by OpenZFS#5229) where ZFS would allow property inheritance and explicit zfs set for copies=3 even when not supported.