Where is my snapshots' used space?

Recently I’ve set up a snapshots schedule for my datasets. Before this I moved my mac time machine backup from my old NAS. It consist of 2 files: .sparsebundle – actual backup(~270GB) and .purgeable – screwed backup(~400GB).

After a while, Time Machine decided to remove .purgeable (perhaps because I’ve properly set the Time Machine quota this time). And here is the lie of the land:


sudo zfs list -t snapshot <time-machine-dataset> shown the same values.

I thought that I embraced basic concepts of ZFS snapshots :roll_eyes:. And that used space of snapshot is the amount of data exclusively contained/referenced by this snapshot. If my understanding is correct, all snapshots combined that still contain .purgeable 400GB file must use (at least) 400GB of space. Yet they use just a few GB. OTOH, the dataset itself shows correct 756 GB usage.

So my question is – where is my backup? Or to be more precise – where is the snapshots’ used space?

Understanding how snapshots work and understanding how the tools present the space allocated by them are two very different things :slight_smile:

You can see the space used by the 400 GiB .purgeable removed when the space Referenced drops by almost 400 GiB (compression). As to why the 400 GiB does not show up in Used for the snapshot immediately prior to the drop in Referenced I cannot say. I can say that snapshot space accounting has been opaque since the very beginning of ZFS (prior to OpenZFS, OpenZFS kept the same behavior).

A snapshot tracks blocks that have changed between snapshots. After you take a snapshot, when a change is made to a block, the copy-on-write mechanism of ZFS copies the original block (leaving it alone) to a new one and writes the updates to that copy. That changed block will become part of the next snapshot.

The Used column indicates the amount of data that changed from the prior snapshot. The Referenced column shows the total amount of data of all the blocks referenced by that snapshot, some of which may not of changed from the prior snapshot.

Your story matches your screenshot; purgeable got deleted at the source sometime between 4/29@ 3:55 and 4/30@11:25. The data for it is still on disk being tracked by the block pointers associated with the snapshots on and before 4/29. Those blocks won’t be reused by ZFS until you purge all of those snapshots for 4/29 and earlier.

I’d read about ZFS Copy On Write.

Yeah, I (kinda) understand it. What I don’t understand is why (the last snapshot with purgeable) “Used” doesn’t show hundreds of GB. As it is still pointing to blocks for this file.

That is not how snapshots on ZFS work. At no point is snapshot data copied anywhere. This is why ZFS snapshots are so very fast and cause no performance degradation.

What you describe is how traditional (non-ZFS) snapshots work.

When you take a ZFS snapshot you are making a copy of the metadata tree at that point in time. Future writes are handled as any other ZFS write, new space is allocated and the active metadata tree is updated with new block pointers for the data that was just written. The snapshot copy of the metadata tree still points to the old data, which is not freed until the snapshot (and it’s copy of the metadata tree) is destroyed. There is no read/write cycle as new data is written.

Remember that ZFS’s copy-on-write design means that no data is ever modified in place, any write allocates new space and the old space is freed (if not in use by a snapshot or clone).

I think the 400 GiB of space used by the snapshot would show up in the Used column for the first snapshot to include that data. After that it only shows up in Referenced.

1 Like

Makes sense. That’s why I’ve used vague “all snapshots combined <…> must use (at least) 400GB”.

So, does ZFS not account used space “instantly”?

I suspect that space for snapshots is not determined at the time the zfs list command is run. That would take a bunch of time as it would have to walk each snapshot metadata tree, so I’m guessing that some space stats are stored in the metadata tree and those are what is used to generate the Used and Referenced columns.

The one place where snapshots are not performant is getting listings of large numbers of snapshots. I have managed systems with hundreds of thousands of snapshots. Our management scripts cached lists of snapshots to speed things up.

Hmm… it’s very interesting. However, don’t your statements kinda contradict each other? I mean, if snapshots’ used space is not calculated on the fly, then caching the list of snapshots (perhaps with “used”) doesn’t make much sense (in this particular aspect at least).

This might help you better understand what you’re observing.

The last part addresses this “math doesn’t add up” phenomenon of ZFS snapshots.

2 Likes

Ok. I do believe that I understand snapshots’ “used” size better now.

Btw, imo, you should have mentioned that if one fills the truck for 95%, they push the truck (to the new city) by hand.

1 Like

That’s not what I said.

This answers my unspoken question – why ZFS even need snapshots to restore the data if it doesn’t delete the blocks right away? It seems like while changes to the data blocks are COW, changes to the metadata are not.
Just imagine if the metadata itself was COW and we could restore any file to any point in time*.


* Limited only by the capacity of the pool.

So while recently deleted data blocks would be technically present on the drives, there would be no metadata (without a snapshot) that refers to those blocks. These are just my guesses.

Can you please advise some readings about how ZFS stores the metadata?

Not quite. Nothing is copied when a snapshot is taken. Rather, all active blocks (including filesystem metadata) are immediately referenced by the snapshot. (i.e, The “colored stickers” on all boxes currently tagged with a “white sticker”. No new boxes are copied or duplicated.)


This is not the case, as I explained above.

Everything is pointed to (referenced) by a snapshot. This includes all data blocks and all filesystem metadata[1]. Everything.

In fact, if you have a 1-GiB file named “bigfile.dat”, then take a snapshot of the dataset, and then the only change you do is to rename the file to “giantfile.dat”…

The snapshot will reference the entire filesystem, block for block, metadata for metadata, with the only difference being the name of the file. (No data blocks will differ whatsoever.)

You can even check with the zfs diff command, which will show you that the only difference between the snapshot and live filesystem is "bigfile.dat was renamed to “giantfile.dat”, and will likely only consume only 4-KiB. (Because that is the amount of unique space consumed by the old metadata.)


  1. Directory tree, modification time, size, filename, etc… ↩︎

So, at the end of the day – are metadata changes also COW. If so, why can’t we restore to the any (recent) point in time?

“Copy on in-place modification”.

“Copy on write” is a bit of a misnomer. :yum:

ZFS, like other “CoW” filesystems, does not modify existing blocks of data or metadata, even if the software thinks it’s doing an “in-place modification”.

Ironically, much software to this day don’t do in-place modifications, as the application will often create a temporary duplicate file for safety reasons. The newly “modified” file is actually a brand new copy with the changes made by the application. (Tools like rsync thankfully provide options, such as --inplace, to be more friendly with CoW filesystems.)


We can?

When you rollback to a snapshot, you basically “rewind” the dataset’s filesystem to that exact point in time.

Unless you’re thinking of something else?

No, I meant even without a snapshot. Snapshots guarantee that data and metadata stay intact regardless of the data amount written after snapshot creation.

But if we don’t have a snapshot, recent (modified) data and metadata blocks would still be there. And I assume that the lesser actual (white stickers) data pool has, the more older blocks are still “intact”.

You can create a task to take a snapshot of your dataset every 5 seconds, if you really need that level of granularity. It would result with a massive snapshot list though.

Otherwise, there’s no grand pointer that can be referred to for rolling back your dataset.

Those blocks would be discarded. After a modify operation completes successfully, the “white stickers” are removed from those particular “boxes”. (The same process as a delete operation. The only difference is that a delete operation involves all blocks associated with the entire file, while a modify operation only involves particular blocks of the file’s composition.)

When I say “modify”, I’m specifically referring to in-place modifications.

I’m aware of that. I was talking about something like hypothetical last(?) resort restoration tool.

What exactly did you mean by “discarded”?
Several years ago, when I just discovered ZFS and decided that my next NAS would make use of it, I read that all ZFS writes (even the random ones) are basically sequential. And the speed degrades over time with disks fragmentation. And all that those 80-95% usage limits are just correlated with the fragmentation level.

So if I understood it correctly back then, then the box won’t be moved away from the truck at the moment it has no more stickers. It will stay in the truck until the very moment the loader can’t find free space for the new box. Then he smashes this non-snickered “empty” box to pieces and places the new box.

Perhaps the TRIM usage will make it all look different, but let’s just assume we are talking about old plain HDDs.

There are “recoveries” possible, which entail using emergency import options for a specific “transaction group” (TXG). This is pool-wide, though, and is meant only for emergencies.

For normal operation, if you don’t have a snapshot, then don’t expect that you can safely rollback to anything.


Not “sequential” in terms of the disk layout, but rather as more efficient “write batches” (known as “transaction groups”).

For all intents and purposes, it is removed upon being discarded. No blocks “linger” in a ZFS pool after they are destroyed. The previously occupied space should be immediately available for future writes. (Maybe not “immediately”, but within seconds at most.)

This is the same for other filesystems. Once a (or all) pointer to a file is removed, the space which it occupied is available for future writes.


You might be confused by the term “copy on write”. It’s misleading, since nothing is copied on write, and technically nothing is copied on modification.

The more accurate term would be: “Do not touch any existing blocks for in-place file modification, but rather write the modified data to a new block, and then upon completion point to this new block and remove the pointer to the old block.” Try making an acronym for that.

Doesn’t roll off the tongue like “copy on write”. :wink: