I’ve been trying to understand dRAID performance in ZFS for ages, and I think I just cracked it.
For the longest time, the weirdest thing to me was that dRAID was stupidly slow with a single vdev even if it had a number of redundancy groups.
A lot of what I’m going to say is speculation based on my experience.
tl;dr
You need multiple vdevs of dRAID to get good performance. Also, partition groups don’t matter. I knew this already but didn’t understand why until now.
History
I’ve been a user of ZFS since 2008. I believe I first used it in OpenSolaris and later with FreeBSD/FreeNAS. It’s come a long way!
I’ve been running dRAID on 3 zpools for 3 years now. Two are 48 and 60 HDDs and one has 128 SSDs not counting metadata drives.
I always used metadata drives, but only recently did I rebuild my pools and switch to using 64K “Metadata (Special) Small Block Size” (special_small_block_size) now that I know how to calculate the size of all my files: zdb -Lbbbs <POOL_NAME>.
If I did 128K blocks to metadata, it’d go from 250GiB to 30TiB, and that’s not gonna fit on my metadata drives.
Blocks vs Records
The way I understand how ZFS splits up data is based on records and blocks, and this is necessary for understanding dRAID performance.
Record size is the max size of any given file before it’s broken up into smaller chunks.
Block size is related to your ashift value. 9 is 512b and 12 is 4K. Meaning the smallest chunk your data can be when it gets written is the block size.
dRAID minimum size
Most of us have 4K blocks. If you have a dRAID with 8d meaning 8 data drives in each partition group, then each write takes a minimum of 32KiB. If you have 16d, then that’s 64K.
This is where metadata drives come in because the minimum you can store in a mirror is the block size (4K). Since I’m storing 64K and below in my metadata mirrors, I can bypass any issues related to storing small files in dRAID.
Compression plays a part here too:
- You cannot compress smaller than the block size.
- ZFS doesn’t combine partial blocks, so you’re still limited to the block size.
- ZFS also has no reason to compress smaller than the number of blocks in dRAID; although, I’m uncertain if it does or not.
Compression means you write less to disk, and the larger your records, the more you can compress. Still, 1M is the fastest record size I’ve seen, not 16M. Might have to do with my processor’s L3 cache size.
Writing data to dRAID vdevs
From what I understand, when ZFS writes data:
- If the file size is below the block size:
- ZFS writes it to a single vdev.
- That vdev writes it to a single drive.
- With dRAID, all other drives get
0s. - If you’re using
special_small_block_size; then it writes that file to the metadata vdev instead of the dRAID vdev.
- If the file size is below the record size:
- ZFS sends that file to a single vdev.
- The vdev chunks it up into blocks and writes those out to each drive.
- In the case of dRAID, this might be split between multiple partition groups, and it’ll write to all of them at the same time, but not the same as vdevs.
- Any remaining space is filled with
0s up to the partition group size.
- When the file size exceeds the record size:
- ZFS splits the file up into record-size chunks.
- Each vdev, in parallel, gets part of the file up to the record size.
- Each vdev takes that record and chunks it further into blocks.
- In the case of dRAID, this might be split between multiple partition groups, and it’ll write to all of them at the same time, but not the same as vdevs.
- Any remaining space is filled with
0s up to the partition group size. It’s possible to have a weird-sized record that results in an uneven number of blocks.
vdev Parallelization
When a vdev’s done writing data, it asks ZFS for more data. So the more vdevs you have, the faster ZFS can go as you’re not limited to the speed of the slowest drive.
If you have both a HDD and SSD vdev in the same zpool, it’s possible the SSD vdev will get written to more often because it’s faster than the HDD vdev. ZFS is intelligent about how it handles this, but faster vdevs get written to more thereby increasing the overall zpool speed.
vdev Limitations
I once had 40 x 2-drive mirrors of SSDs, and performance was abysmal. I thought it was something in my hardware, but after converting to dRAID, I was getting insane numbers like 17GB/s from barely 2GB/s.
In my experience, ZFS has a problem scaling up vdevs, and most of your drives will sit idle if you have too many. 8 vdevs was still fast for me in my testing, but there’s definitely a point where it can’t parallelize fast enough to saturate all your drives.
But having more vdevs is faster in general even with dRAID.
Partition group parallelization
This is a problem with dRAID because even though you have a bunch of partition groups, ZFS doesn’t write to them in parallel in the same way it does with vdevs.
Tons of Speculation
This section has a ton of my own speculations from experience.
If you have a 1M record size and get 1M of data thrown into a dRAID vdev, it can split that among all partition groups multiple times, but only for a given slice.
If your dRAID vdev is evenly spaced, you need to wait for all partition groups to finish before writing again.
This is different from how vdevs work as once a vdev is done, it’s ready for the next record whereas dRAID has to write records in slices, and it won’t start the next one until the current one is complete.
Ideally, it’d calculate how much it needs to write and parallelize it until that block is written, but in my experience, that’s not the case leading to slowdowns.
I believe that if your dRAID slices are unevenly spaced, you can keep writing until the “slice” has evened out again; therefore, it allows for far more parallelization.
Like with vdevs, if you have too many partition groups, there’s a maximum number of partition groups you can write to because each vdev gets only up to the record size of data.
But let’s say ZFS is smart and runs dRAID operations in parallel. Provided you have at least 2 non-overlapping partition groups,
Also, the smaller your partition groups, the fewer drives are included in each write operation, so you’re less likely to be limited by the slowest drive because the slowest drive is only one of many partition groups.
Conclusion
I know for sure that adding more dRAID partition groups doesn’t act like adding more vdevs. The performance is completely different from having multiple dRAID vdevs.
I’m not 100% sure exactly why, but I think it has to do with how a vdev doesn’t ask for another record until it’s done writing the current one. In that way, the dRAID vdev can’t queue up multiple blocks to ensure faster writes, and it can’t parallelize writes more the way it writes to partition groups.
In general, I’ve found 2 partition groups to be the sweet spot in terms of performance.
Numbers
Write speed = slowest drive speed * vdevs * partition groups
In the past, I had 2 partition groups and 8 vdevs (16). At ~500MB/s write speed of a SATA III SSD, that’s 8GB/s.
I was getting 6-7GB/s write speeds over SMB. Makes sense it’s less because ZFS incurs overhead, and I suspect dRAID does as well.
With my new setup, it’s 3 partition groups and 4 vdevs (12), so it should be 6GB/s which is gonna end up being 4-5GB/s in real life. That lines up with the file-copy test I did which was ~4GB/s using rclone in the CLI.
Then again, I have a HDD zpool with 2 partition groups and 2 vdevs writing at 2GB/s with a zfs send operation. Each HDD does something like 250MB/s max, so it should max out at 700MB/s, yet it it’s way faster.
Note
I’m talking about write speed. Read speed can parallelize better to the point where I can completely saturate my SAS controllers.