Understanding and thoughts about dedup bottlenecks & hashes

ajp_anton · May 16, 2025, 2:12pm

This is just going to be throwing out thoughts about how to make dedup faster. This might be stupid, since I know I don’t understand everything about how dedup works. This might not work at all, or it might even be how it already works and I just understood it poorly.

So listening to a long talk about the new “fast dedup”, it seems a big problem is the amount of memory required for all of the hashes for all of the blocks, and sorting and searching through them. If I understood correctly, fast dedup kind of just guesses which of those hashes can be thrown away and forgotten, because those blocks “probably” won’t get a duplicate later.

Something I thought of was why not use much shorter hashes? You could store the full 512-bit hash somewhere where you have lots of space, like on disk, and only have them truncated in RAM. The shorter hashes are more likely to collide (still very unlikely, but no longer insignificant) and a “real” collision would be bad. However, in the unlikely event of a collision, you just pay the performance penalty of checking the full version of the hash from the slower disk. For the vast majority of the time though, you would be working with a much smaller set of data.

As the number of blocks gets bigger, you could dynamically adjust the truncation of the hashes to fit in memory. This would increase the likelyhood of a collision, but you would never give up any functionality or reliability, only performance, as you’ll just have to fall back to reading from the disk more often. The extreme case would be to have 1-bit hashes in RAM, so that you “only” have to read 50% of the hashes from disk for every new block written. This is silly though, so there would be some limit.

Not sure if I messed up a calculation somewhere, but let’s say you stored extremely short 32-bit hashes in RAM. Say you use 128kiB blocks, you already have 1PiB of data, and you want to write 1GiB of new data. This will result in on average 16k collisions, meaning you have to read that many 512b hashes from disk, or 1MiB of data (although randomly scattered). That’s not insignificant, but I’d guess tolerable for WORM data. With 64-bit hashes in RAM, you could write another PiB of data and only get 6 collisions.

winnielinnie · May 16, 2025, 10:16pm

I think your premise misses the point, but I could be wrong. Someone more versed in deduplication will correct me on this.

I don’t believe the digest size of the hash (256-bit, 512-bit) is the main constraint of deduplication, but rather the speed of the function (SHA256, SHA512, and BLAKE3) and the total number of entries in the table.

To truncate the hash in RAM wouldn’t likely improve performance, since the hash for every newly written block must still be calculated and compared against the existing table.

In your proposition, more time will be wasted for those instances where there is a “fake” collision, since the full hash must be retrieved from storage.

ajp_anton · May 17, 2025, 6:19pm

Well, the point was that the likelihood of a hash collision would be high enough to not be acceptable just by itself, but low enough to cause practically no performance penalties. If you have to read 6 hashes from disk, when you have 1PiB of data and writing another 1PiB of data, I’d say that’s completely negligible.

But if the real bottleneck is indeed the calculation of the hashes, then this wouldn’t help, as you still have to do that calculation. I just thought this would solve one of the problems with very large dedup tables, which is why pruning and quotas were introduced at all, but maybe I just misunderstood how the whole thing works?

winnielinnie · May 17, 2025, 7:41pm

The hashes are more CPU intensive per block (SHA512 or SHA256, compared to the crazy fast Fletcher4)^[1], but it’s not just that.The total number of entries to hold in the DDT, as well as comparing every new write against these entries, is much more costly than the digest size of the hashes in the table.

Truncating the digests only adds extra steps without any real benefit, other than saving how much RAM? (You’re expected to have a lot of RAM in your system if you’re going to use deduplication anyways.)

As a matter of fact, having 125,000,000 (125 million) entries in the DDT would only save you 2 GB of RAM if you truncated 32-byte digests down to 16 bytes each.^[2] Is it really worth it?

BLAKE3, introduced in OpenZFS 2.2, narrows this gap. ↩︎
Adjust this calculation accordingly if you want to truncate even more. You still would require a ton of entries in the DDT to save a handful of GB’s of RAM, at most. ↩︎