Ultra wide RAIDZ3 in combination with special vdev? Or dRAID?

I found an interesting reddit discussion with a user that claims to run successfully a 90 wide RAIDZ3 with 22TB drives. He is using special vdev with 1MB special_small_blocks 16M record size for mostly large files. The impressive part; he is able to resilver in 2 days!

I assume that his resilvers are so fast, because the whole blocktree is on special vdev?

Either way, it is fascinating that it works as good as it does. I am not sure if that is because the pool is only 2y old, or because he really found a good edge case where ultra wide RAIDZ do actually work.

Anyway, the alternative to that ultra wide RAIDZ3 would be dRAID. He wants as much storage as possible. He gets currently 87 / 90 which is an impressive 96,7%.

His reasoning for not using dRAID was that it is not as storage efficient because of the fixed stripesize. This is where I struggle. I don’t get how dRAID makes use of special vdev and if it really is as big of a deal as he thinks.

Assuming you want to store a 5GB movie on your 16M dataset, his argument was that this 5GB file consists of multiple 16M records. Maybe there is a last record, a tail, that is not using the whole 16M but only 4k. That last record would be filled with 0, compressed down to 4k and then use a whole dRAID stripe. It is also not perfect for RAIDZ3, but for dRAID it is even worse. But since it is only the last record out of many, the loss here is not that big on dDRAID nor RAIDZ.

What I struggle to understand is how a single 16M record behaves. Assuming we have a dRAID3:85d:2s:90c as a better alternative to his current RAIDZ3, the smallest stripe would be 85 * 4 = 340k.

A single 16M record would consist of 16 * 1024 / 340 = 49 stripes.
The last stripe only has 16 * 1024 - 340 * 48 = 64k of actual data.

So yes, to store a 16M record, you would use 49 * ( 340 + 12 dRAID3 + 8 2s) = 17640

instead of RAIDZ3 with 16 × 1’024 ÷ 348 = 47 full stripes plus one stripe with 16 × 1’024 - 47 * 348 = 28k actual data. That would result in
47 * 360 + 40k = 16960k.

So for every single 16M record that alternative would use 16960 / 17640 = almost 4% more. Sure, the dRAID having two spares while his RAIDZ has none is also a factor for that. But I kinda get his point, why dRAID is not the best fit for him if max storage is the highest priority.

It would be way worse for a 1MB record, if it weren’t for special vdev.

So I was wondering how special vdev decides, what lands on it. Does it only look at the record size?

Or does it to some ZFS magic and realize that for example the last stripe is only 4k in actual size and because of that would be better suited to land on svdev?

90-wide??? And, I assume, no backup due to the sheer size. He really should be running dRAID3 rather than raidz3 and allow some spares.
Two days should be about the time that is needed to write data to the new 22 TB drive.

As for your questions about the special vdev, read here:

As I understand it, ZFS looks at every chunk and if the chunk is compressible enough or smaller than the small block cutoff, it goes in the sVDEV. So it’s not just tails of a file that may end up there, it can be guts too.

With large video files, I imagine you may have the occasional tail due to the size of the small block cutoff, but little else. Those types of files are typically already compressed to within an inch of their life.

I am not familiar enough with ZFS metadata operations to know how the sVDEV helps re: resilver for ultra-wide pools. I reckon the complexity of administering such a wide pool (even with large recordsizes) would usually lead to a lot of metadata IOPS, which in turn the sVDEV can process very quickly. In a regular HDD NAS, all the metadata would slow things down quite a bit, especially with a COW file system.

So the sVDEV may be covering for something the usual wisdom here would not recommend. This sounds like a pool for a single user with just media files that is an interesting edge case. I imagine write performance isn’t awesome but if the primary goal is storage efficiency, then that’s likely an acceptable tradeoff.

1 Like

I wonder if the use case is write once, read mostly. If that were the case, then the files would not fragment the free space of the very wide RAID-Z3. Add in a Metadata special vDev, and the Metadata would also not fragment the free space of the very wide RAID-Z3 vDev.

The normal problem with very wide RAID-Zx vDevs, is that the free space becomes fragmented over time. Eventually leading to slower and slower read access.

It is not clear how a re-silver could complete in a reasonable amount of time. The only way I can think that could happen, is if the files were small enough to not use the full 90 disk width. For example, using 16MB block size and files that were using less than 1/2 of the stripe width, like 672MB after compression, then the resilver would not need to read the entire stripe width, (minus 2 of the parity).

If files routinely used the full stripe width, (87 data columns), then those, (or 86 + parity), would need to be read in order to create the replacement block for a resilver. I would think that would be extremely slow.


Of course, without complete details, the user could be using 3 RAID-Z3 vDevs of 30 disks each. Much more manageable with only 27 data columns.

Link to the reddit post if someone is interested.

As I understand it, he really is using 90wide RAIDZ3, mostly movies, plus Chia.

I am still struggle if that part I wrote is true or not

Because even ignoring that edge case, if it were true, this could also have an impact on more “normal” setups.

If I have a dRAID2:20d:2s:24c, that would mean that my smallest stripe is 80k data.
I would add a svdev with 127k special_small_blocks.

So if I then setup dataset with a 128k recordsize, and write a file that is 128k (after compression), there would be a 88k wide stripe storing the first 80k.

So then we are left with 48 of data we still need to store. Would that be:

A: The leftover 48k lands on the svdev, because it is smaller than 127k

B: The leftover 48k will not land on the svdev, since ZFS is looking at the record size and not the data size for a stripe. So these 48k will need another 80k data stripe that is 88k wide. So in total we need 176k raw capacity to store 128?

I can’t imagine B to be true, that would make dRAID so hard to use for so many setups. Or am I just clueless?

nobody?
Damn I would love to have 90 drives to run some tests :grin:
Seems to be very hard to find some real world examples.

Only found this critical presentation, but unfortunately there is no follow up.

Having 90 drives running as a single pool/node instead of using them with a distributed storage solution looks like nonsense to me. Or close to nonsense.

1 Like

You could be right.

But remember when it used to be nonsense for 99% of use cases to use dedup?
And then svdev changed the dedup game? Sure it is still nonsense for 98% but at least there are some new edge cases.

I wonder if svdev also could change the game for wide RAIDZ.
Imagine something like a Netflix cache server or a Jellyfin server with 1000 users.
He might not get good IOPS for writes, but since he only writes sequential large files, that does not matter. Sequential read speeds are of course excellent.

Never read a bit about Netflix architecture, but I assume that their workload can be considered random reads. And I assume they are wealthy enough to use all NVMe. It may even be somehow related to the rise of enterprise read-intensive QLC drives :thinking:.

The same goes for 1000 user jellyfin – 1000 IOPS is a little bit too much for HDD RAIDz.

I don’t think it would be random read (I mean sure it is random which 7GB file is read, but that is not the same as random 16k reads) and that is why I also don’t think that it would be too much for 1000 users streaming. He is apparently getting 18GBit/s sequential read out of that thing.

Still wondering how special vdev decides, what lands on it. Does it just look at the size after compression? I think so.
And if yes, does it look at the size of the write itself (I think this is the case) or how big the stripe for said data is (a 100k data file would still need a 348k stripe).