ELI5: ZFS "rewrite"

From last week’s episode of TrueNAS Tech Talk, the subject of “ZFS rewrite” came up.

Apparently, this is a zfs subcommand to read and write existing blocks of data, while bypassing some userspace overhead. It does not suffer from the “double allocation problem” associated with conventional file copying in the presence of snapshots. Nor does it, supposedly, modify any filesystem metadata, such as timestamps and filenames.

Sounds great, right?

Did you change the dataset’s recordsize and want to retroactively apply it to existing files? Now you can.

Did you change the dataset’s compression and want to retroactively apply it to existing files? Now you can.

Did you add a new vdev to your pool to increase its total capacity, and now you wish to rebalance your existing data across all vdevs? Now you can.

Did you expand your RAIDZ vdev, and now you wish to rebalance the existing data and use the more “efficient” allocation (and more accurate space calculation)? Now you can.


I’m not quite sold yet because I really don’t know what is happening. I tried to read through the PR on OpenZFS GitHub, but the technical jargon flew over my head.

I’ll ask my questions like I’m 5 years old, if anyone would be so kind to answer them like I’m 5.

@mav, since you wrote the code, I’d very much appreciate if you are able to give “user-friendly” explanations.

I apologize if these were already answered on GitHub. I tried my best to read through the entire thread, but got lost in some of the technical stuff.

  1. What happens if you lose power or the system crashes in the middle of using zfs rewrite?
  2. Are the data blocks that comprise the file the only things being read and rewritten? No metadata blocks are being touched?
  3. If you have a 1-MiB uncompressible file currently saved under a dataset with recordsize=128K, it is comprised of 8 blocks. If you change the dataset to recordsize=1M and then run zfs rewrite on the file, will it now be comprised of a single 1-MiB block? If so, doesn’t this mean that zfs rewrite in a sense “violates” a rule of ZFS? What happens with an existing snapshot that is referring to the 8 blocks of data?
  4. Similar to point 3, what happens if you go in the opposite direction? You change the recordsize from 1M to 128K. A snapshot referring to a single block (with a unique pointer) must now point to 8 different blocks with 8 different pointers?

:warning: This makes me feel uneasy, since mucking around with existing blocks of data that are being referenced by snapshots seems risky, and it could introduce unpredictable bugs in the future.

I might not be interpreting this correctly, and could be mistaken about what zfs rewrite actually does to existing blocks and how it affects current snapshots.

  1. What happens if you lose power or the system crashes in the middle of using zfs rewrite?

Since rewrite does not modify user data, it does not need to write ZIL. So after reboot you’ll see a state up to the last transaction ZFS was able to commit, that may be few seconds back. Any blocks that could have been rewritten after that will just return to their previous state, unless modified by some other application with more strict requirements.

  1. Are the data blocks that comprise the file the only things being read and rewritten? No metadata blocks are being touched?

The command explicitly rewrites only user data blocks of regular files. That in turn affects related indirects, dnodes, space maps, you name it, all the way up to uberblock. Though if you have some large directories, extended attributes or something else not referencing blocks of regular files, they are not currently rewritten.

Did you change the dataset’s recordsize and want to retroactively apply it to existing files? Now you can.
3.
4.

No, you can’t. Rewrite works on per-block level, so the recordize is one thing it can not change.

:warning: This makes me feel uneasy, since mucking around with existing blocks of data that are being referenced by snapshots seems risky, and it could introduce unpredictable bugs in the future.

Nice thing about rewrite is that it does nothing user can’t do otherwise. In particular, it does not affect snapshots. As price for that, if blocks you rewrite are part of snapshot, being cloned, etc, they’ll become independent copies, that may result in additional space usage, additional incremental replication. etc. It should be considered before running it on large chunks of data at once to not get surprised.

2 Likes

This is what originally confused me. Now it makes much more sense. :slightly_smiling_face:


What was throwing me off is that you issue zfs rewrite against files and folders (i.e, “higher” level than blocks). You’re not issuing the command against a dataset or ZFS object.

This gave me the (incorrect) impression that it was reading and rewriting entire files. Whatever the file’s composition, in terms of blocks, is irrelevant for the rewrite. As if you are writing brand new files that do not yet exist, or even doing a standard file-based “copy then delete original”. I thought zfs rewrite was only using the inputted files to know which blocks to target (and possibly combine).

In this regard, I was wrong. (Thankfully so!)


On that matter, and based off of some comments I read, using zfs rewrite could reclaim space with highly compressible files after changing the dataset’s compression property?


Let me throw a curve ball at you.

What about files (and hence blocks) that only exist in snapshots? :wink:

Based on what you said, this is what happens when you use zfs rewrite:

  1. The highly compressible file big.dat currently exists in the dataset’s live filesystem, as well as in some snapshots.
  2. You run zfs rewrite /mnt/mypool/mydataset/big.dat
  3. All the blocks that comprise big.dat are rewritten with better compression, thus saving space on the pool
    3a. The snapshots that reference these blocks operate as usual. They’re pointing to the same blocks, which just happen to be smaller due to better compression?

Is that right? If so, here’s my curve ball to you, which I would be very curious about with zfs rewrite:

  1. The highly compressible file big.dat only exists in snapshots
  2. You run zfs rewrite /mnt/mypool/mydataset/.zfs/snapshot/manual-20250501/big.dat
  3. Are all the blocks that comprise big.dat rewritten with better compression, thus saving space on the pool?
    3a. Does it matter that they are only referenced by snapshots?
    3b. If so, why? Wouldn’t it be the same as with a readonly filesystem? Shouldn’t zfs rewrite still work on a readonly filesystem, since it is only changing the underlying blocks?

EDIT: I missed this. You might have added it later in your post.

That actually changes a lot about my assumptions. It suggests that the presence of snapshots nullifies the space-saving benefits of using zfs rewrite as opposed to file-based method of “copy and delete originals”. :frowning_face:

It suggests that the presence of snapshots nullifies the space-saving benefits of using zfs rewrite as opposed to file-based method of “copy and delete originals”.

Right. You can not run rewrite on snapshot, since they are read-only by design. Running on a file in a live file system will separate from snapshot, taking additional space. If the file (actually its blocks that you rewrite) is also a part of some snapshot, then the old space will be freed only when the last snapshot including it is deleted. There is reserved space for future options to limit which blocks to not rewrite, like blocks belonging to snapshots, etc, but not yet implemented.

1 Like

Now it all makes sense.

I was of the impression that zfs rewrite is a low(er)-level tool that works under any userspace awareness.

I applaud the work being done, and I’m glad to see ZFS add another tool to its belt.

However, snapshots are so ubiquitous in their use. As a conservative user, I would never dare destroy my snapshots for the sake of “rebalancing”, even if it can be done faster than using traditional userspace tools.

If your snapshots are not permanent and have a life time, you could rewrite some part at a time and give snapshots few weeks to rotate in between.

1 Like

Yes, this is likely the best option.

Of course, if the files have block clones or the file’s dataset has clone(s) on the snapshots, that is a different story.