ZFS Overhead query for 1MB record sizes + 48x 4TB Layout recommendation?

Hi again folks

Perhaps my Google-fu is weak, or I’m simply not well versed enough in the terminology to find a suitable explanation… So please bear with me?

I created a temporary pool with 24x 4TB HDD’s for testing purposes - 2x 12-wide RAID-Z2 VDEVs

Original usable space prediction was around ~72TB on 1MB record size, but that drops to a reported ~66TB in practice - a bit of fiddling on the online capacity calculator, and some forum trawling, revealed that the deflation calculation is based on 128KB record size, so that’s an ~8.7% ZFS overhead.

Now, while I’ve configured the dataset to 1MB to get me closer to that 72TB, total pool space still reflects as ~66TB.

More searching, found this on github:

 Compute the raidz-deflation ratio.  Note, we hard-code 128k (1 << 17)
 because it is the "typical" blocksize.  Even though SPA_MAXBLOCKSIZE
 changed, this algorithm can not change, otherwise it would inconsistently
 account for existing bp's.

So, then where and how does the real world benefit of 1MB record sizing come about?

  1. Does it more efficiently fit “more” data into the reported space? IE, Supposing one now writes 24TB of large files with minimal padding (1/3 of 72TB at 1MB record size) to the pool, would that then reflect as only using 22TB of the reported 66TB capacity?

  2. Or, does the total reported pool capacity dynamically adjust, based on the efficiency as you fill the pool?

  3. Ooooor… would filling the pool to 60TB (~91% of reported capacity) mean you’re ACTUALLY only filling it to around ~83% capacity?

I just want to ensure that I’m not barking up the wrong tree here…

The setup is going to be 48x 4TB drives in a JBOD enclosure - primarily for cold backup of my media server at home - big files, new ones synced once every few months - nothing fancy… hardly anything gets deleted or altered, ever.

  • 4x 11-wide Z3’s VDEVs report space the most accurately up-front, (and have the added benefit of 8x data stripes), but seems wasteful from both parity & utilization angles, netting me 116TB from 44x drives, and forcing 4x to sit idly by as spares… so I’m pondering alternatives…

  • 4x 12-wide Z2 VDEVs seem just about spot-on… 144TB on 1MB record size. Probably a tad risky that wide, but it’s not exactly an awful scrub/resilver time on 4TB drives.

  • For that matter, even better - I could just run each 12-wide Z2 VDEV as an individual pool — 4x 36TB pools
    More admin & effort, yes… but I don’t need high I/O for media backups, scrubs/resilvers are quick, and if something ever goes wrong with the JBOD enclosure, I can simply throw 12x drives into an MD1200 shelf, and have immediate access to the data - as opposed to having to transfer all 48x drives over to shelves before the pool can be fired back up.

  • 3x 16-wide Z3 gets 141TB… statistically safer than 12-wide Z2 according to calculators… And can also run them as 3x 47TB pools.
    Buuuuuut, too wide according to most, with odd data stripe sizes, and apparently prone to fragmentation & very long scrub times. Also 16 doesn’t translate easily into transplanting into disk shelves.

What say you, ye TrueNAS faithful?

Thanks in advance.

A 4TB drive (4x 10^12 bytes) is shown in TrueNAS as 3.638TiBs (3.638 * 2^40).

So 2x 12-wide RAIDZ2 vdevs is effectively 20 drives worth of data, so 72.7TiB which is exactly what you say is predicted.

The reason that smaller records have lower storage efficiency is because for the same amount of data you need more records, and they need more metadata blocks.

However, the reason for choosing a record size is not just about storage efficiency - it is also about I/O efficiency - a larger record size means that data is more likely to be contiguous reducing seeks which slow things down, and data is read and written in larger more efficient chunks.

Storage efficiency is also usually more influenced by the number of redundancy disks than record size

Finally since this is a backup server where the pool size is going to be fixed at 48 drives, and where resilver times may be less important, this might be a use case for using wider than normal vDevs and for compensating for this by using dRAID - so perhaps dRAID pseudo vDevs of 23-wide dRAID3 with 2x hot spares or 2x 22-wide dRAID2 with 4x hot spares.

This would give you the same useable space as 4x 12-wide RAIDZ2, with the same redundancy AND 4x hot spares with faster resilvering.

1 Like

Faster resilvering but half the redundancy: Each dRAID2 vdev has only two drives worth of redundancy in total, not two drives per redundancy group.
(@jro has written some resources on dRAID… which should convince the reader that this layout is best left to professional sysadmins.)

1 Like

Not entirely what I was getting at.

As per:

2x 12-wide RAID-Z2 Pool

  1. @128K record size — 66.359TiB, 76.002% space efficiency, 8.797% ZFS overhead.

  2. @1M record size — 72.155TiB, 82.641% space efficiency, 0.831% ZFS overhead

I understand the mechanism and reasoning for 1M - Hence this thread.

What puzzles me, is that ZFS reports pool capacity based on the assumption of 128K record size (width-related ZFS overheads / deflation calculations and all) - and it will report that figure regardless of upping the dataset record size.

I’d like to know where the rubber meets the road - where and how does writing 1MB records yield the advantage in the pool’s reported used % / capacity figures?

With 1MiB record sizing, my Pool should have a capacity of ~72.155TiB

It reports 66.38TiB, assumed on 128K record size - a ratio of 91.996%… call it 8% smaller.

  • Does data written in 1M record size reflect a lower Used capacity than 128K records would have?
    IE - If I write exactly 10TiB of file data to the pool, will it reflect in the GUI as only using ~9.2TiB of that 66.359TiB pool capacity?

  • …or, does the reported Total pool capacity figure increase at some point to reflect the higher efficiency / lower ZFS overhead of 1M records?

Unless I’m incorrectly going about creating a pool, and there’s some way of natively creating one based on reporting according to 1MB record size?

I have given dRAID it a fair bit of consideration (and a lot of reading + toying around on calculators) - but regardless of which way you slice it, it seemingly has too many potential gotchas and idiosyncrasies to seriously consider at this point in time… Wide dRAID2 simply isn’t an option - failure curve shoots up quick.

dRAID3 could be an option - but if one wants to get silly about widths, 2x 24-wide Z3 could be done as well, with slightly more capacity, and more reliable on paper to boot… :rofl:

Additionally, if the day comes that I can pick up some cheap 6/8/10TB drives to expand the backup pool, I’d have to buy an awful lot of them to upgrade a dRAID array…

I’ll see… I need to check further into fragmentation potential and the likes - will maybe slap together a dRAID3 pool and play with it for a bit. But, as per @etorix – having read way more of @jro 's site than I was originally planning, dRAID really doesn’t seem intended for the home gamer - and I’m not sure I’d want to guinea pig my backup dataset on changing that stigma.