This is just going to be throwing out thoughts about how to make dedup faster. This might be stupid, since I know I don’t understand everything about how dedup works. This might not work at all, or it might even be how it already works and I just understood it poorly.
So listening to a long talk about the new “fast dedup”, it seems a big problem is the amount of memory required for all of the hashes for all of the blocks, and sorting and searching through them. If I understood correctly, fast dedup kind of just guesses which of those hashes can be thrown away and forgotten, because those blocks “probably” won’t get a duplicate later.
Something I thought of was why not use much shorter hashes? You could store the full 512-bit hash somewhere where you have lots of space, like on disk, and only have them truncated in RAM. The shorter hashes are more likely to collide (still very unlikely, but no longer insignificant) and a “real” collision would be bad. However, in the unlikely event of a collision, you just pay the performance penalty of checking the full version of the hash from the slower disk. For the vast majority of the time though, you would be working with a much smaller set of data.
As the number of blocks gets bigger, you could dynamically adjust the truncation of the hashes to fit in memory. This would increase the likelyhood of a collision, but you would never give up any functionality or reliability, only performance, as you’ll just have to fall back to reading from the disk more often. The extreme case would be to have 1-bit hashes in RAM, so that you “only” have to read 50% of the hashes from disk for every new block written. This is silly though, so there would be some limit.
Not sure if I messed up a calculation somewhere, but let’s say you stored extremely short 32-bit hashes in RAM. Say you use 128kiB blocks, you already have 1PiB of data, and you want to write 1GiB of new data. This will result in on average 16k collisions, meaning you have to read that many 512b hashes from disk, or 1MiB of data (although randomly scattered). That’s not insignificant, but I’d guess tolerable for WORM data. With 64-bit hashes in RAM, you could write another PiB of data and only get 6 collisions.