Viability of ZFS dedupe for archive data

brettfk · August 13, 2024, 10:46am

I’m in the process of building a NAS that will be used to effectively archive a large amount of raw data. Performance is highly unlikely to ever be an issue as this archive may be accessed once or twice a year at most.

The plan is to build something with 180TB of usable storage. The files are already compressed so not expecting much saving here. Each file is somewhere between 1 and 1.5GB in size - I’m expecting by the end of this year to have tens of thousands of them.

I want (and need) to keep cost down so I’m thinking of utilizing ZFS dedupe on the volume, assuming the data can be deduped (something we are testing at the moment). The server is currently specced with 512GB RAM along with the 180TB of storage.

I’ve seen many posts in the past about this but wanted to ask about our specific use case. We’re a small business so budget is tight. Appreciate your thoughts on this, and what the best record size may be to make the most of dedupe.

Johnny_Fartpants · August 13, 2024, 10:55am

I wouldn’t recommend dedup atm as it can still bite you in the butt and the savings vary massively. I would instead explore different compression options on your datasets and send over some sample data and find the best fit for you. You may also want to do the same with record size for example try a dataset with 1M and compare to the default. Your biggest cost savings could be on how you configure your system (vdev type and size) and the cost of your hardware unless you are using existing hardware.

ericloewe · August 13, 2024, 12:07pm

If it’s archived data, just dedupe it offline with whatever tool works best for you. No point in throwing that workload onto ZFS.

Arwen · August 13, 2024, 3:56pm

If the data is pre-compressed, their is a chance that the foreign compression program adds date / time stamp, or other information to the file. Thus, even 2 identical source files, compressed at different times may not be able to use ZFS de-dup for all the blocks.

Further, ZFS has recently added ZSTD compression, which may mean that un-compressing the source files and allowing ZFS to compress and de-dup them might end up better. And the files appear to any network share access, as un-compressed, (even though it’s compressed on disk).

Last, towards the end of the year, a new OpenZFS feature may be released, Fast De-Dup. It may or may not help your use case. Plus, many times it is best to wait on major new features until others have proven the new feature reliable. (I may be paranoid, but I tell you my storage is out to get me!)

Stux · August 13, 2024, 6:35pm

Dedup trades memory and cpu to save hard disk.

Memory is significantly more expensive than disk.

I’d say you need significant dedup savings to justify the extra equipment cost over just buying more disk.

Worth reading the story of “that guy”

Davvo · August 13, 2024, 6:39pm

Usually it’s cheaper to buy more HDD space than running deduplication.