Fast Dedup uses... magic?

winnielinnie · May 2, 2024, 6:04pm

If you’re reading this thread because of my impeccable click-bait game… that just proves I’m good at it! Spare me your jealousy.

On the iXsystems announcement page for Fast Dedup, it includes a new “feature” that supposedly makes the Fast Dedup Table (FDT) notably more efficient:

“Favor”?
“Potential”?

That implies some sort of automation or prediction, such as entirely skipping a new block of data from having its hash written to the table.

But the only reference to pruning I could find is a new command that will be available in OpenZFS: zpool ddtprune

This implies a manually-invoked command that simply removes “single-hit” hashes from the table.

Is there anything happening under-the-hood with Fast Dedup, in which it will actually skip new blocks that don’t have “dedup potential”?

This raises another question:

What if someone “prunes” their Fast Dedup Table, removing all single-hit hashes, but then later such blocks are indeed “dedup-able”? “Too bad, so sad?” The existing blocks (whose hashes were previously removed from the table) will forever consume the extra amount of space that could have been “zero” had they remained part of the dedup table?

ericloewe · May 2, 2024, 6:18pm

Yeah, your analysis matches my overall understanding. A lot rests on the hypothesis that dedupable data will be meaningfully correlated in time. For instance, it’s not likely that OS bits for various VMs will magically be the same after a year, but very likely that VMs deployed on the same week will have some dedup potential.

iXChris · May 3, 2024, 5:32pm

Long video, but worth a watch. Doesn’t explain everything, but demystifies some of the new features of fast dedup.

etorix · May 4, 2024, 5:14pm

Is there a corresponding written article or blog post?

winnielinnie · May 5, 2024, 7:47pm

I watched the entire thing, and while I get confused by some of the technical details, it did clear up this question:

Apparently, it doesn’t simply remove “all” single-hit entries from the table. It only prunes single-hit entries that are older than 90 days. Allan Jude himself said that this is based on “intuition”, and there’s the possibility in the future to gather metrics with “ghost entries” to determine if this might result in too much pruning of blocks that would have been “de-dupable”.

So it’s a trade-off: Keep the table small, which allows more room in the ARC for actual data cache, and prioritize already “established” deduped blocks. (At the small risk that you might have lost some de-dupable blocks in the table.)

There were some other questions raised after I watched the presentation, but I’ll save it for later.

Going to slip my personal opinion in here: I think dedup (and even fast dedup) requires a really specific use-case to justify its benefits over its costs. Seriously. Watch the video. Even fast dedup has to do A LOT, with added levels of complexity and performance hits, requiring more resources and RAM from your system…

…all to possibly save some storage space…

…even greater than the savings of inline compression…

…even with the advent of block-cloning?

(Yes, yes, I know block-cloning is still disabled by default for precautionary reasons, but in principle it can save tons of space without requiring special vdevs or a massive dedup table, or any extra complexity. You just copy a file, anywhere in the pool, and you’re done.)

As for dedup and fast dedup, too many things need to line up in order for it to be justified. I would even guess that 95%+ of TrueNAS home users would only be harmed by using deduplication.

Yes, it’s nice that deduplication is getting a re-haul, but I don’t see the appeal nor excitement for it. Better handling of inline ZSTD compression and (safe) block-cloning are something to celebrate. Fast Dedup? I’d say “that’s neat” and never touch it.

iXChris · May 6, 2024, 12:44pm

@etorix

Not from us. We will likely have publish one on the lead up to 24.10.

iXChris · May 6, 2024, 12:47pm

It is a good watch, but yes, very complicated.

I look forward to getting some empirical data on fast dedup, I am optimistic that it will have more real world use cases than standard ZFS dedup, but do expect home lab use to be limited. I hope to be surprised though.

Davvo · May 6, 2024, 5:35pm

Imho fast dedup will make dedup way more accessibile to home users, but we will see. SCALE has other issues to address right now.

OpenZFS’s videos are always a great learning experience, love them.

winnielinnie · May 6, 2024, 5:52pm

I don’t think home users should even play around with Fast Dedup. Even those who “think” they need it, probably don’t.

They would still need to be diligent and assess whether they even need deduplcation in the first place.

This is in light of:

we already have inline compression (fast, can save space)
we already (“not yet”) have block-cloning (no special setup needed, can save space)
additional, redundant vdev to hold FDT (fast dedup table)
additional RAM requirements for deduplication in general
and so on…

Arwen · May 7, 2024, 12:36am

Looks like my warning Resource on not using De-Dup was not migrated over, so I can’t update it for Fast De-Dup. (And I am too lazy to re-write it here in the new forums…)

But, I will put a link to it here for anyone that runs across this thread and thinks about using ZFS De-Dup;

HoneyBadger · May 7, 2024, 3:08pm

Thanks for the heads-up. I’ll get those migrated over and credit accordingly.

NickF1227 · May 8, 2024, 9:12pm

From a use case standpoint, and as it relates to home users, I can see a lot of potential here. Particularly with virtual machines and applications, where we’ll probably get better ratios than file shares.

Many (most?) home users are storing media files for later consumption (Plex). It would be entirely wasteful to run dedupe for that.

But if you can get even a 1.5x reduction on small 1TiB all flash pool for your KVM homelab VMs? Heck yeah! Granted there will be a trade off in available RAM for VMs and ARC.