I found an interesting reddit discussion with a user that claims to run successfully a 90 wide RAIDZ3 with 22TB drives. He is using special vdev with 1MB special_small_blocks 16M record size for mostly large files. The impressive part; he is able to resilver in 2 days!
I assume that his resilvers are so fast, because the whole blocktree is on special vdev?
Either way, it is fascinating that it works as good as it does. I am not sure if that is because the pool is only 2y old, or because he really found a good edge case where ultra wide RAIDZ do actually work.
Anyway, the alternative to that ultra wide RAIDZ3 would be dRAID. He wants as much storage as possible. He gets currently 87 / 90 which is an impressive 96,7%.
His reasoning for not using dRAID was that it is not as storage efficient because of the fixed stripesize. This is where I struggle. I don’t get how dRAID makes use of special vdev and if it really is as big of a deal as he thinks.
Assuming you want to store a 5GB movie on your 16M dataset, his argument was that this 5GB file consists of multiple 16M records. Maybe there is a last record, a tail, that is not using the whole 16M but only 4k. That last record would be filled with 0, compressed down to 4k and then use a whole dRAID stripe. It is also not perfect for RAIDZ3, but for dRAID it is even worse. But since it is only the last record out of many, the loss here is not that big on dDRAID nor RAIDZ.
What I struggle to understand is how a single 16M record behaves. Assuming we have a dRAID3:85d:2s:90c as a better alternative to his current RAIDZ3, the smallest stripe would be 85 * 4 = 340k.
A single 16M record would consist of 16 * 1024 / 340 = 49 stripes.
The last stripe only has 16 * 1024 - 340 * 48 = 64k of actual data.
So yes, to store a 16M record, you would use 49 * ( 340 + 12 dRAID3 + 8 2s) = 17640
instead of RAIDZ3 with 16 × 1’024 ÷ 348 = 47 full stripes plus one stripe with 16 × 1’024 - 47 * 348 = 28k actual data. That would result in
47 * 360 + 40k = 16960k.
So for every single 16M record that alternative would use 16960 / 17640 = almost 4% more. Sure, the dRAID having two spares while his RAIDZ has none is also a factor for that. But I kinda get his point, why dRAID is not the best fit for him if max storage is the highest priority.
It would be way worse for a 1MB record, if it weren’t for special vdev.
So I was wondering how special vdev decides, what lands on it. Does it only look at the record size?
Or does it to some ZFS magic and realize that for example the last stripe is only 4k in actual size and because of that would be better suited to land on svdev?