Unexplainable But Extremely Fast File Transfers

Ok, this one is totally baffling to me. I often move largish (greater than 50GB) files from one dataset to another on my pool. I use Windows 11/Server Side Copy, so it normally goes at about 1.5GB/s. It’s an all SSD pool, and while I think it should be faster, that’s fast enough for me.

Today, for the first time since I think upgrading to 24.04.2 I was doing a copy, and it went very fast. Too fast, as in 17GB/s fast. I tried again with a larger number of files, and this is what it looked like:
explorer_mwoLGWBokb

To try and be perfectly clear, this shows copying files from Pool/Dataset1 to Pool/Dataset2.

There is no way even my SSD pool can write and read to itself that fast, and while the transfer is happening there is no disk activity at all shown in netdata or iostat which implies no data is actually being copied, yet, the files all show up and look normal in Windows explorer and via terminal. The don’t appear to be links to files. Deduplication is turned off.

This behavior is also exhibited on a Windows 10 VM, so it’s not isolated to Windows 11 or my desktop computer.

When a file is copied to another dataset, the sized used on the dataset as shown in the TrueNAS GUI increases by the appropriate amount for the file size. In this case they are video files, and the video files can be played back with no issue after being copied.

Doing file copies via a terminal session using the cp command go at 2GB/s with the normal corresponding read and write activity on the drives, so the phenomenon I’m seeing only happens with moving or copying files via Windows Explorer.

I have another pool, and file copies back and forth to that pool via Windows Explorer go at a more normal 1GB/s.

I’m seeing this to all datasets in the pool except one, and that dataset uses 128k record sizes. Changing it’s record size to 1M like the others replicates the fast transfer behavior. Changing it back to 128k reverts it back to around 1GB/s. Clearly this is related to record size somehow.

So what is going on here? I’m stumped as I’ve never seen this before. The speed is fantastic, but I don’t want to rely on it being real.

I suspect that what is happening here is ZFS “block cloning” whereby the data is not actually copied but instead the existing blocks are referenced in the new file system and de-referenced in the old one (but potentially still referenced in snapshots in the old one).

Since data is not being copied, only metadata is being updated, it can be insanely fast (and save you disk space because the data blocks are not actually duplicated until they are updated)

3 Likes

:point_up: This is correct. You are seeing block-cloning at work here. ZFS is essentially inline de-duplicating the data and copying only metadata surrounding the file copy.

2 Likes

Awesome to see a relatively new building block fit alongside existing capabilities to enable good, useful functionality with a very visible impact.

2 Likes

Fascinating! Thank you for the responses.

After some quick Googling, this seems like a relatively new thing. When was it introduced to Scale? I’m pretty sure it was not present in Core.

You mention that existing blocks are referenced in the new files system and de-referenced in the old one. I assume that if I were to delete datasets that data was copied from, ZFS is smart enough to know not to delete blocks allocated to it that are now pointed to another dataset? Not sure if this makes sense, but I’m used to thinking of datasets as completely separate file systems that are not able to share or track centralized data.

I’ve never used deduplication because I haven’t had a great use case for it, and saw all the warnings about memory allocation and performance issues. How is this different? Why doesn’t it have the same overhead?

My understanding of deduplication was that it was only within a dataset, and not pool wide, but that since it functioned on a block level, it would save space on files that were not identical. With block cloning, it obviously saves the space for identical files that are duplicated, but will it save space for files that of similar blocks? Could it save space on a pool wide basis even if files are not exact copies? Am I just confusing the two technologies together?

I dug up the following:

root@nas[~]# zpool get allocated,bcloneused,bclonesaved,bcloneratio Main
NAME PROPERTY VALUE SOURCE
Main allocated 34.9T -
Main bcloneused 11.0G -
Main bclonesaved 44.0G -
Main bcloneratio 5.00x -

I do not have duplicate files on this pool as far as I’m aware, but my interpretation of this is that there is 11G of physical data with 44G of “reference” data and that block cloning is saving me 44G of space. Am I interpreting this correctly? How to find where this data is? I’m wondering if it from the dataset where I’m storing some backups that might have duplicate blocks. It would be nice if some of these stats were present in the GUI under datasets since the data for space used can be confusing. I can easily “consume” more data than the size of the disks in my pool.

Why does this work through Windows explorer and not the cp command? Do file management tools need to be block cloning aware, and Windows just happens to be?

Anyway, this is very cool, and I appreciate your patience with my questions!

With OpenZFS 2.2.1 (iirc 2.0 had the nasty bug).

1 Like

There are two parts to make this work, firstly server side copying in SMB, and then the range copy function on the server now uses block cloning.

The problem with actual de-dupe is finding the original block that the new block would duplicate, so you can reference the original when writing a new block. Hence dedupe tables.

When block cloning you already know where the original blocks are that you are writing dupes for, so don’t.

If your record size is different, it means the block size will be different and thus can’t be a clone.

1 Like

To add, (perhaps adding some confusion…), block cloning is more like hard links, with copy on change.

This is similar to how a clone of a dataset works. Any change to the clone that was shared with the original, (via a snapshot), causes the cloned dataset to allocated new space for the change. (Including new metadata for the change).

1 Like

@Arwen With block cloning, let’s say I make a copy of a 1GB file, and this takes up no additional space because it’s a clone. What happens if I make 500MB of changes to the copy? Does it simply use up 500MB of storage to save the difference, making my total storage for the 2 files 1.5GB?

@MikeyG - Short answer yes, you then use 1.5GBytes.

Longer answer, would be 1.5GBytes AND some extra metadata space to reference the changes for the side that was changed.

Could be even more than 1.5GBytes if you touched 1 byte in a bunch of ZFS blocks. A ZFS block might be 128KBytes and touching 1 byte in a block would force the entire block to be un-block cloned. Thus taking up more space.


ZFS stores 2 copies of metadata by default, (aka directory entries that reference data), because ZFS considers loss of a directory entry worse than loss of data. It's complicated, but in a regular file system if you loose the directory entry, you may have lost the entire file. (Though some FSCK processes may be able to restore some or all of the data.)

ZFS takes the opinion that metadata is just more important. It takes little space, when compared to most data. So in general it is not a problem.

As an example, my miniature media server has ZFS Mirrored OS but the actual media is striped across both storage devices, (to allow for enough space). Any loss of a block, forces me to restore the media file from backups. Not a problem, just takes time.

But once I saw a bad block error that was automatically corrected. Drove me crazy for a few weeks until I realized that the block was in metadata, which had 2 or more copies. So, ZFS was able to automatically supply the metadata to the requesting program, AND fix the broken copy.

3 Likes

I love ZFS!