The problem with RAIDZ

iX_Resources · April 16, 2024, 2:21pm

This resource was originally created by user: Jamberry on the TrueNAS Community Forums Archive. Please DM this account or comment in this thread to claim it.

The problem with RAIDZ or why you probably won’t get the storage efficiency you think you will get.
Work in progress, probably contains errors!
Better formatted github version.
As a ZFS rookie, I struggled a fair bit to find out what settings I should use for my Proxmox hypervisor. Hopefully, this could help other ZFS rookies.
This text focuses on Proxmox, but it generally applies to all ZFS systems.
The whole text and tables assume ashift = 12 or 4k, because that is the default for modern drives.

TLDR

RAIDZ is only great for sequential reads and writes of big files. An example of that would be a fileserver that mostly hosts big files.
For VMs or iSCSI, RAIDZ will not get you the storage efficiency you think you will get, and also it will perform badly. Use mirror instead. It is a pretty long text, but you can jump to the conclusion and the efficiency tables at the end.

Introduction and glossary

Before we start, some ZFS glossary. These are important to understand the examples later on.

sector size:

older HDDs used to have a sector size of 512b, while newer HDDs have 4k sectors. SSDs can have even bigger sectors, but their firmware controllers are mostly tuned for 4k sectors. There are still enterprise HDDs that come with 512e, where the “e” stands for emulation. These are not 512b but 4k drives, they only emulate to be 512. For this whole text, I assume that we have drives with 4k sectors.

ashift:

ashift sets the sector size, ZFS should use. ashift is a power of 2, so setting ashift=12 will result in 4k. Ashift must match your drive’s sector size. Extremely likely this will be 12 and also automatically detected.

dataset:

A dataset is inside a pool and is like a file system. There can be multiple datasets in the same pool, and each dataset has its own settings like compression, dedup, quota, and many more. They also can have child datasets that by default inherit the parent’s settings. Datasets are useful to create a network share or create a mount point for local files. In Proxmox, datasets are mostly used locally for ISO images, container templates, and VZdump backup files.

zvol:

zvols or ZFS volumes are also inside a pool. Rather than mounting as a file system, it exposes a block device under /dev/zvol/poolname/dataset. This allows to back disks of virtual machines or to make it available to other network hosts using iSCSI. In Proxmox, zvols are mostly used for disk images and containers.

recordsize:

Recordsize applies to datasets. ZFS datasets use by default a recordsize of 128KB. It can be set between 512b to 16MB (1MB before openZFS v2.2).

volblocksize:

Zvols have a volblocksize property that is analogous to recordsize.
Since openZFS v2.2 the default value is 16k. It used to be 8k.

padding:

ZFS allocates space on RAIDZ vdevs in even multiples of p+1 sectors to prevent unusable-small gaps on the disk. p is the number of parity, so for RAIDZ1 this would be 1+1=2, for RAIDZ2 this would be 2+1=3,for RAIDZ3 this would be 3+1=4.
To avoid these gaps, ZFS will pad out all writes so they’re an even multiple of this p+1 value. Padding is not really writing data onto disks, it just leaves these sectors out.

With that technical stuff out of the way, let’s look at real examples

Dataset

Datasets only apply to ISOs, container templates, and VZDump and are not really affected by the RAIDZ problem. You can skip this chapter, but maybe it helps with understanding.
Let’s look at an example of a dataset with the default recordsize of 128k and how that would work. We assume that we want to store a file 128k in size (after compression).

For a 3-disk wide RAIDZ1, the total stripe width is 3.

One stripe has 2 data sectors and 1 parity sector. Each is 4k in size.
So one stripe has 8k data sectors and a 4k parity sector.
To store a 128k file, we need 128k / 4k = 32 data sectors.
To store 32 data sectors, each stripe has 2 data sectors, so we need 16 stripes in total.
Or you could also say, that to store a 128k file, we need 128k / 8k data sectors = 16 stripes.
Each of these stripes consists of two 4k data sectors and a 4k parity sector.
In total, we store 128k data sectors (16 stripes * 8k data sectors) and 64k parity sectors (16 stripes * 4k parity sectors).
Data sectors + parity sectors = total sectors
128k + 64k = 192k.
That means we write 192k to store 128k data.
192k is 48 sectors (192 / 4).
48 sectors is a multiple of 2, so there is no padding needed.
128k / 192k = 66.66% storage efficiency.

This is a best-case scenario. Just like one would expect from a 3-wide RAID5 or RAIDZ1, you “lose” a third of storage.

Now, what happens if the file is smaller than the recordsize of 128? A 20k file?

We do the same steps for our 20k file.
To store 20k, we need 20k / 4k = 5 data sectors.
To store 5 data sectors, each stripe has 2 data sectors, so we need 2,5 stripes in total.
Half-data stripes are impossible. That is why we need 3 stripes.
The first stripe has 8k data sectors and a 4k parity sector.
Same for the second stripe, 8k data sectors and a 4k parity sector.
The third stripe is special.
We already saved 16k of data in the first two sectors, so we only need to save another 4k.
That is why the third stripe has a 4k data sector and a 4k parity sector.
In total, we store 20k data sectors (2 times 8k from the first two stripes and 4k from the third stripe) and 12k parity sectors (3 stripes with 4k).
20k + 12k = 32k.
That means we write 32k to store 20k data.
32k is 8 sectors (32 / 4).
8 sectors is a multiple of 2, so there is no padding needed.

The efficiency has changed. If we calculate all together, we wrote 20k data sectors, 12k parity sectors.
We wrote 32k to store a 20k file.
20k / 32k = 62.5% storage efficiency.
This is not what you intuitively would expect. We thought we would get 66.66%!

We do the same steps for a 28k file.
To store 28k, we need 28k / 4k = 7 data sectors.
To store 7 data sectors, each stripe has 2 data sectors, so we need 3,5 stripes in total.
Half-data stripes are impossible. That is why we need 4 stripes.
The first three stripes have 8k data sectors and a 4k parity sector.
The fourth stripe is special.
We already saved 24k of data in the first two sectors, so we only need to save another 4k.
That is why the fourth stripe has a 4k data sector and a 4k parity sector.
In total, we store 28k data sectors (3 times 8k from the first three stripes and 4k from the fourth stripe) and 16k parity sectors (4 stripes with 4k).
28k + 16k = 44k.
That means we write 44k to store 28k data.
44k is 11 sectors (44 / 4).
11 sectors is not a multiple of 2, so there is padding needed.
We need an extra 4k padding sector to get 12 sectors in total.

The efficiency has changed again. If we calculate all together, we wrote 28k data sectors, 16k parity sectors, and one 4k padding sector.
We wrote 48k to store a 28k file.
28k / 48k = 58.33% storage efficiency.
This is not what you intuitively would expect. We thought we would get 66.66%!

What happens if we wanna save a 4k file?

We calculate the same thing for a 4k file.
We simply store a 4k data sector on one disk and one parity sector on another disk. In total, we wrote a 4k data sector and a 4k parity sector.
We wrote 8k in sectors to store a 4k file.
4k / 8k = 50% storage efficiency.

This is the same storage efficiency we would expect from a mirror!

Conclusion for datasets:
If you have a 3-wide RAIDZ1 and only write huge files like pictures, movies, and songs, the efficiency loss gets negligible. For 4k files, RAIDZ1 only offers the same storage efficiency as mirror.

ZVOL and volblocksize

For Proxmox we mostly don’t use datasets though. We use VMs with RAW disks that are stored on a Zvol.
For Zvols and their fixed volblocksize, it gets more complicated.

In the early days, the default volblocksize was 8k and it was recommended to turn off compression. Until very recently (2024) Proxmox used 8k with compression as default.
Nowadays, it is recommended to enable compression and the current default is 16k since OpenZFS v2.2. Some people in the forum even recommend going as high as 64k on SSDs.

In theory, you want to have writes that exactly match your volblocksize.
For MySQL or MariaDB, this would be 16k. But because you can’t predict compression, and compression works very well for stuff like MySQL, you can’t predict the size of the writes.
A larger volblocksize is good for mostly sequential workloads and can gain compression efficiency.
Smaller volblocksize is good for random workloads, has less IO amplification, and less fragmentation, but will use more metadata and have worse space efficiency.
We look at the different volblocksizes and how they behave on different pools.

volblocksize 16k

This is the default size for openZFS since 2.2.

RAIDZ1 with 3 drives

With 3 drives, we get a stripe that is 3 drives wide.
Each stripe has two 4k data sectors (8k) and one 4k parity sector.
For a volblock of 16k, we need two stripes, because one stripe stores 8k data and two stripes will store the needed 16k (16k/8k).
Each stripe has two 4k data sectors, two stripes are in total 16k.
Each stripe has one 4k parity sector, two stripes are in total 8k.
That gets us to 24k in total to store 16k.
24k is 6 sectors and that can be divided by 2 so there is no padding needed.
Storage efficiency is 66.66%, as expected.

RAIDZ1 with 4 drives

With 4 drives, we get a stripe 4 drives wide.
Each stripe has three 4k data sectors (12k) and one 4k parity sector.
For a volblock of 16k, we need 1.33 stripes (16k/12k).
The first stripe has three 4k data sectors, in total 12k.
The first stripe also has one 4k sector for parity.
The second stripe has one 4k data sector.
The second stripe also has one 4k sector for parity.
In total, we have four 4k data sectors and two 4k parity sectors.
That gets us to 24k in total to store 16k.
24k is 6 sectors and that can be divided by 2 so there is no padding needed.
We expected a storage efficiency of 75%, but only got 66.66%!

RAIDZ1 with 5 drives

With 5 drives, we get a stripe 5 drives wide.
Each stripe has four 4k data sectors (16k) and one 4k parity sector.
For a volblock of 16k, we need 1 stripe (16k/16k).
In total, we have four 4k data sectors and one 4k parity sector.
That gets us to 20k in total to store 16k.
20k is 5 sectors and that can’t be divided by 2 so there is an additional padding sector needed.
That gets us to 24k in total to store 16k.
We expected a storage efficiency of 80%, but only got 66.66%!

RAIDZ1 with 10 drives

With 10 drives, we get a stripe 10 drives wide.
That 10 drives wide stripe in theory would get us 9 data sectors and one parity sector.
A single stripe could thous hold 9 * 4k = 36k.
But that is no of no use to us, we only need 16k!
ZFS will shorten the stripes.
For a volblock of 16k, we need one stripe with 4 data sectors and one parity sector.
In total, we have four 4k data sectors and one 4k parity sector.
The stripe is only 5 drives wide.
That gets us to 20k in total to store 16k.
20k is 5 sectors and that can’t be divided by 2 so there is an additional padding sector needed.
That gets us to 24k in total to store 16k.
We expected a storage efficiency of 90%, but only got 66.66%!

Notice something? No matter how wide we make the RAIDZ1, there are no efficiency gains beyond 5 drives. This is because we can’t make the stripe any wider, no matter how wide we make your RAIDZ1. Because of padding, even going from 4 to 5 drives wide does not help with storage efficiency.

RAIDZ2 with 4 drives

With 4 drives, we get a stripe 4 drives wide.
Each stripe has two 4k data sectors and two 4k parity sectors.
For a volblock of 16k, we need two stripe (16k/8k = 2).
That gets us to 32k in total to store 16k.
32k is 8 sectors and that can’t be divided by 3 so there is padding needed.
We need another padding sector to get to 9 sectors total.
9 sectors can be divided by 3.
That gets us to 36k in total to store 16k.
We expected a storage efficiency of 50%, but only got 44.44%!
That is WORSE than mirror!

RAIDZ2 with 5 drives

With 5 drives, we get a stripe 5 drives wide.
Each stripe has three 4k data sectors and two 4k parity sectors.
For a volblock of 16k, we need 1.33 stripes (16k/12k).
The first stripe has three 4k data sectors, in total 12k.
The first stripe also has two 4k sectors for parity.
The second stripe has one 4k data sector.
The second stripe also has two 4k sectors for parity.
In total, we have four 4k data sectors and four 4k parity sectors.
That gets us to 32k in total to store 16k.
32k is 8 sectors and that can’t be divided by 3 so there is padding needed.
We need another padding sector to get to 9 sectors total.
9 sectors can be divided by 3.
That gets us to 36k in total to store 16k.
We expected a storage efficiency of 60%, but only got 44.44%!
That is WORSE than mirror!

RAIDZ2 with 6 drives

With 6 drives, we get a stripe 6 drives wide.
Each stripe has four 4k data sectors and two 4k parity sectors.
For a volblock of 16k, we need one stripe (16k/16k = 1).
That gets us to 24k in total to store 16k.
24k is 6 sectors and that can be divided by 3 so there is no padding needed.
We expected a storage efficiency of 66.66%, and got 66.66%!

RAIDZ2 with 10 drives

With 10 drives, we get a stripe 6 drives wide.
This is because we don’t need 10 drives to store 16k.
It also behaves exactly as with 6 drives.
Each stripe has four 4k data blocks and two 4k parity blocks.
For a volblock of 16k, we need one stripe (16k/16k = 1).
That gets us to 24k in total to store 16k.
We expected a storage efficiency of 80%, but only got 66%!

volblocksize 64k

Some users in the forums recommend 64k on SSDs.
I am no VM expert by any means, but there are still a lot of workloads that are smaller than 64k.
I would recommend using volblocksize 64k with caution. You will could get huge read-write amplification and fragmentation.

RAIDZ1 with 3 drives

With 3 drives, we get a stripe 3 drives wide.
Each stripe has two 4k data sectors (8k) and one 4k parity sector.
For a volblock of 64k, we need eight stripes (64k/8k = 8).
Each stripe has two 4k data sectors, eight stripes are in total 64k.
Each stripe has one 4k parity sector, eight stripes are in total 32k.
That gets us to 96k in total to store 64k.
96k is 24 sectors and that can be divided by 2 so there is no padding needed.
Storage efficiency is 66.66%, as expected.

RAIDZ1 with 4 drives

With 4 drives, we get a stripe 4 drives wide.
But not all stripes have three 4k data sectors (12k) and one 4k parity sector.
For a volblock of 64k, we need 5.33 stripes (64k/12k).
Five stripes have three 4k data sectors, in total 60k.
Five stripes also have one 4k sector for parity, in total 20k.
The sixth stripe has one 4k data sector.
The sixth stripe also has one 4k sector for parity.
In total, we have sixteen 4k data sectors and six 4k parity sectors.
That gets us to 88k in total to store 64k.
88k is 22 sectors and that can be divided by 2 so there is no padding needed.
We expected a storage efficiency of 75%, but only got 72.72%!

RAIDZ1 with 5 drives

With 5 drives, we get a stripe 5 drives wide.
For a volblock of 64k, we need 4 stripes (64k/16k).
Each stripe has four 4k data sectors and one 4k parity sector.
In total, we have 16 4k data sectors and 4 4k parity sectors.
That gets us to 80k in total to store 64k.
80k is 20 sectors and that can be divided by 2 so there is no padding needed.
Storage efficiency is 80%, as expected.

RAIDZ1 with 6 drives

With 6 drives, we get a stripe 6 drives wide.
But not all stripes have five 4k data sectors (20k) and one 4k parity sector.
For a volblock of 64k, we need 3.2 stripes (64k/20k).
Three stripes have five 4k data sectors, in total 60k.
Three stripes also have one 4k sector for parity, in total 12k.
The fourth stripe has one 4k data sector.
The fourth stripe also has one 4k sector for parity.
In total, we have 16 4k data sectors and four 4k parity sectors.
That gets us to 80k in total to store 64k.
80k is 20 sectors and that can be divided by 2 so there is no padding needed.
We expected a storage efficiency of 83.33%, but only got 80%!
The same problem applies for 7 or 8 drives wide!

RAIDZ1 with 9 drives

With 9 drives, we get a stripe 9 drives wide.
For a volblock of 64k, we need two stripes (64k/32k).
Each stripe has eight 4k data sectors and one 4k parity sector.
In total, we have 16 4k data sectors and two 4k parity sectors.
That gets us to 72k in total to store 64k.
72k is 18 sectors and that can be divided by 2 so there is no padding needed.
Storage efficiency is 88.88%, as expected.

RAIDZ1 with 10 drives

With 10 drives, we get a stripe 10 drives wide.
But not all stripes have nine 4k data sectors (36k) and one 4k parity sector.
For a volblock of 64k, we need 1.77 stripes (64k/36k).
First stripe has nine 4k data sectors, in total 36k.
First stripe also has one 4k sector for parity.
The second stripe has seven 4k data sectors, in total 28k .
The second stripe also has one 4k sector for parity.
In total, we have 16 4k data sectors and two 4k parity sectors.
That gets us to 72k in total to store 64k.
72k is 18 sectors and that can be divided by 2 so there is no padding needed.
We expected a storage efficiency of 90%, but only got 88.88%!
The same problem applies for all RAIDZ1 that are wider than 10 drives!

RAIDZ2 with 6 drives

With 6 drives, we get a stripe 6 drives wide.
Each stripe has four 4k data sectors and two 4k parity sectors.
For a volblock of 64k, we need four stripes (64k/16k = 4).
That gets us to 96k in total to store 64k.
96k is 24 sectors and that can be divided by 3 so there is no padding needed.
Storage efficiency is 66.66%, as expected.

RAIDZ2 with 7 drives

With 7 drives, we get a stripe 7 drives wide.
But not all stripes have five 4k data sectors (20k) and two 4k parity sectors.
For a volblock of 64k, we need 3.2 stripes (64k/20k).
Three stripes have five 4k data sectors, in total 60k.
Three stripes also have two 4k sectors for parity, in total 24k.
The fourth stripe has on 4k data sector.
The fourth stripe also has two 4k sectors for parity.
In total, we have 16 4k data sectors and eight 4k parity sectors.
That gets us to 96k in total to store 64k.
96k is 24 sectors and that can be divided by 3 so there is no padding needed.
We expected a storage efficiency of 71.42%, but only got 66.66%!
Same is true for 8 or 9 wide RAIDZ2!

RAIDZ2 with 10 drives

With 10 drives, we get a stripe 10 drives wide.
Each stripe has eight 4k data sectors and two 4k parity sectors.
For a volblock of 64k, we need two stripes (64k/32k = 2).
That gets us to 80k in total to store 64k.
80k is 20 sectors and that can’t be divided by 3 so there is padding needed.
We add a padding sector to get 21 sectors in total.
That gets us to 84k in total to store 64k.
We expected a storage efficiency of 80%, but only got 76.19%!
The same problem applies to all RAIDZ between 10 and 17 drives wide!

efficiency tables

Efficiency tables for different numbers of drives, with 16k or 64k volblocksize, and what efficiency you would naturally expect to get. When expectations match up, it is formatted in bold.

RAIDZ1

Code:

	3 drives	4 drives	5 drives	6 drives	7 drives	8 drives	9 drives	10 drives	11 drives	12 drives
16k	66.66%	66.66%	66.66%	66.66%	66.66%	66.66%	66.66%	66.66%	66.66%	66.66%
64k	66.66%	80%	80%	80%	80%	80%	88.88%	88.88%	88.88%	88.88%
expected	66.66%	75%	80%	83.33%	85.71%	87.5%	88.88%	90%	90.90%	91.66%

RAIDZ2

Code:

	4 drives	5 drives	6 drives	7 drives	8 drives	9 drives	10 drives	11 drives	12 drives	13 drives	14 drives	15 drives	16 drives	18 drives
16k	44.44%	44.44%	66.66%	66%	66%	66%	66%	66%	66%	66%	66%	66%	66%	66%
64k	48.48%	53.33%	66.66%	66.66%	66.66%	66.66%	76.19%	76.19%	76.19%	76.19%	76.19%	76.19%	76.19%	88.88%
expected	50%	60%	66.66%	71.42%	75%	77.77%	80%	81.81%	83.33%	84.61%	85.71%	87.5%	88.23%	88.88%

Conclusion

RAIDZ is different from traditional RAID and often has worse storage efficiency than expected.
Bigger volblocksizes offer better space efficiency and compression gains,
but will also suffer from read and write amplification and create more fragmentation.
Also, keep in mind that all these variants will only write as fast as the slowest disk in the pool.
Mirrors have a worse storage efficiency but will offer twice the write performance with 4 drives and 4 times the write performance with 8 drives over a RAIDZ pool.
Use mirrors for Zvol and RAIDZ for huge, sequential files.

Sara · April 16, 2024, 2:33pm

Huh, did not know that you will migrate this stuff
Anyway, for an updated version with better formatting, here is the Github Link.

For some reason I don’t understand, Github markdown still looks wrong on discourse.

etorix · April 16, 2024, 3:06pm

Somewhat continuing on the topic of raidz and its hidden drawbacks, this article is well worth a read:
https://louwrentius.com/the-hidden-cost-of-using-zfs-for-your-home-nas.html

nas · May 16, 2024, 4:56am

Expert feedback sought for creating accurate diagrams…

I am trying to visualize the impact of different record size settings and physical HDD sector sizes and I want to make it really obvious what the difference is and how it affects performance.

I have this snippet of a diagram I am creating:

This shows an example of where the dataset/zvol recordsize is smaller than the disk sectorsize.

And here is the opposite view:

I have shown 3 layers in the above diagrams, but I want to make sure I have accounted for all layers of abstraction between the physical disk platter and the filesystem (the files as seen by the user, that is), so the labels on the left probably need tweaking and I may need to add more layers to show the complete picture.

Therefore, can I ask for feedback to help me get this right?

Other questions I have

When I run geom disk list I see each disk has both a Sectorsize and a Stripesize - what is the difference?
Which one represents that real physical sector size (low level format) on the disk platter?
Can I change the stripe size on a disk to match the sector size or vice versa?
For what reason(s) would we ever want to leave them as different sizes? (IOW, why would we not reformat them to be the same size, e.g. both at 4KB?)
Should I perform a low-level format of each physical disk to force both 4KB sector and stripe sizes?

I found this video which seems to do a good job at describing the problems with sector and filesystem cluster misalignment, but I think it also illustrates some of the issues with dissimilar physical sector and logical cluster (ZFS pool record?) sizes:

Also, from FreeBSD Mastery: Advanced ZFS, page 178:

The most important tuning you can perform for a database is the dataset block size, through the recordsize property. The ZFS recordsize for any file that might be overwritten needs to match the block size used by the application. Tuning the block size also avoids write amplification. Write amplification happens when changing a small amount of data requires writing a large amount of data. Suppose you must change 8 KB in the middle of a 128 KB block. ZFS must read the 128 KB, modify 8 KB somewhere in it, calculate a new checksum, and write the new 128 KB block. ZFS is a copy-on-write filesystem, so it would wind up writing a whole new 128 KB block just to change that 8 KB. You don’t want that. Now multiply this by the number of writes your database makes. Write amplification eviscerates performance. While this sort of optimization isn’t necessary for many of us, for a high-performance system it might be invaluable. It can also affect the life of SSDs and other flash-based storage that can handle a limited volume of writes over their lifetime… Before creating a dataset with a small recordsize , be sure you understand the interaction between VDEV type and space utilization. In some situations, disks with the smaller 512-byte sector size can provide better storage efficiency…you may be better off with a separate pool specifically for your database, with the main pool for your other files.

Emphases mine. The above quote is in the context of databases, but the principle applies to other files too, would it not? (Hence my other post asking what the best settings are for specifically photos, or what the best settings are for video files.)

Sara · May 20, 2024, 6:17pm

Not sure if I understand everything, and no expert on everything, so take my post with a huge grain of salt. But I think you have some basic misunderstandings. I don’t know where, so I hope it becomes more clear after my answers.

I don’t understand your visus.

That would be a total no go!

One is the sectorsize of the disk, and the other is the stripesize of the stripe.

sectorsize

They are not related the way you think they are.
I want 4k as sectorsize, because that is what my drive uses internally.
I want a stripesize of 128k, because I store only large files. Or I want 16k becauese I have many small read and writes.

If the drive is not very old or some special 512e drive, it will 99% of the time be 4k. So you probably should use 4k. And you should not need to force this either, because the OS should detect it.

No longer relevant, unless you are running some very old software.

Not sure what your question is but yes, if you use block storage, it works best if the workload matches your volblocksize.

To small and you create some overhead that has a small performance impact and you get worse compression.

To big and you get all the problems described in your quoted text.

winnielinnie · May 20, 2024, 6:35pm

You can start with this post, to remove some of the confusions.

There are no “abstraction layers”, per se.

A file is data of a specific size, with said bytes in a specific order.

It physically exists on the platters of an HDD (or cells of an SSD), regardless of filesystem. EXT4, XFS, BTRFS, ZFS, doesn’t matter.

However, for ZFS, this data is divided into blocks that follow certain rules, including the recordsize policy.

To read a file, it doesn’t simply grab all the bytes (that comprise the file) into RAM. It grabs all the blocks-on-disk.

Since these blocks-on-disk can exist as compressed and/or encrypted, they’re not necessarily bit-for-bit identical when held in RAM (decompressed and/or decrypted).

The smallest writable (unit) to construct a block-on-disk is the ashift value, which defaults to 4 KiB with TrueNAS. (This also has implications for compression efficiency.)

Whether your HDD is 512e or 4Kn, ZFS is only issuing 4 KiB writes/reads to/from the drives in the pool. So even if you don’t low-level format your HDD to pure 4Kn (i.e, no more “512-byte emulation”), you really won’t face any performance impact.

nas · May 28, 2024, 12:21am

Yep, I agree…what I am trying to do is create a visual graphic that anyone can look at and intuitively understand the reason why this is no go.

This doesn’t really make it any clearer, I’m sorry.
What is the difference between a stripe and a sector?
I understand in terms of RAID that a stripe spans multiple physical disks and is essentially the combination of sectors across the disks that are members of the stripe. Is that what you mean?
Does ZFS write the stripesize value to each disk within the stripe when it adds those disks to said stripe, such that each disk ‘knows’ something about the ‘greater storage reality’ they are a part of?

(Having said that, I can’t recall if I noticed the stripesize before or after I created the pool…)

Stux · May 28, 2024, 2:21am

Nothing is ever smaller than the disk’s sector size :-\

Even if a block ends up compressed to smaller than a sector size it will be padded out, but the record size is never configured smaller than the sector size (i believe)

Sara · May 28, 2024, 5:06am

There are two completely different things. You can’t really ask for differences, since they have nothing in common. Read up on what they are. When you understand what they are, you understand what the difference is.

In short:
sectorsize describes the (physical) size of sectors of the disk. This can’t really be changed (it could software wise, but then performance would tank).

The volblocksize describes how TrueNAS or ZFS presents blockstorage (not datasets) to the user.

This stripesize describes how big the stripe is. Stripesize behaves different on dataset than on blockstorage. It can be calculated for both. For example a RAIDZ1 with three drives. If the volblocksize is 16k, and sectorsiez is 4k, for TrueNAS to present this block to the user, the system would need a stripe with 4 times 4k data (this equals to 16k needed).
The first stripe would be 2 times 4k data and 1 time 4k parity. Same for the second stripe. Stripesize for each would be 12k. For a 4k file that is written to a dataset, the stripe would be 4k data and 4k paritiy for a total of 8k stripesize. But if we put a 8k on the same dataset, stripesize would be 12k.

For a start, I think it is best to ignore ashift = 12 (sectorsize 4k) and just accept that it is 4k for any modern drive. That makes it easier to understand.

nas · May 28, 2024, 11:58am

I’ll try to better explain what I mean:

is like saying

instead of something like

Does this explain my confusion over that statement a little better?

This, I do understand (have done for years).

I thought that doing a low-level disk format is exactly that - physically reformatting the disk surface with different sector sizes. Am I wrong?

I think this is part of the confusion: from FreeBSD Mastery: Storage Essentials, Page 160-161:

This, to me, is a little too vague and it doesn’t obviously match the typical meaning of ‘stripe’ in the context of RAID, or even the context of creating a ‘striped’ pool in ZFS (which is essentially the same as a RAID stripe).

Continuing the quote:

The above quote, however, mentions naught about parity sectors/disks.

So…it reads as though my understanding of RAID stripes is the same. I.e when you say stripe, you mean a stripe across multiple disks in a striped RAID. ZFS is, after all, a form of software RAID.

This reads as though the stripe size changes according to the size of the file being stored…I don’t think this is really what you mean, though.

I suspect what you actually mean, @Sara, is that the size of the file being stored determines how many stripes are required based on how many data sectors are required to fit the file and how many data sectors exist per stripe.

I am simple minded , so I find an analogy helpful: let’s talk in terms of buildings, floors and rooms.
I have 3 buildings: A, B and C, each with 10 floors (0-9), each floor with 10 rooms (0-9).
All 3 buildings together are a RAIDZ VDEV.
Each building is a hard disk.
Buildings A and B store objects, building C [magically] stores the parity of the objects.
Each floor in each building is equivalent to a head or disk surface.
Each room on each floor is equivalent to a physical sector and can contain a single object that is no bigger than 512 cubic feet.

Let’s pretend 1 cu. ft. = 1KB for the sake of this analogy.

If you want to store an object (a [compressed] file) that is small enough to store in one room, you store it in room0 on floor0 in Building A (bAf0r0), and you store the [magical] parity version of the object in room0 on floor0 in Building C (bCf0r0). Since bBf0r0 is empty, bCf0r0 would contain an exact copy of the object in bAf0r0).

If you need to store an object that occupies 2 rooms, you break the object into 2 halves, you store the first half in (say) bAf0r1, the second half in bBf0r1, and the [magical] parity object - that is the parity of both halves - in bCf0r1.

Ergo, objects (files) that occupy 1-2 rooms/sectors in A & B always consume a 3rd room in C for the object/file parity.

If you have an object that is greater than 1024 cu.ft. but less than or equal to 1536 cu.ft. you split the object into thirds (I know it would be split according to room/sector size - this is for the sake of argument), you store the 1st third in bAf0r2, the 2nd third in bBf0r2, the parity of the 1st and 2nd thirds in bCf0r2, the 3rd third in bAf0r3, then finally the parity of bAf0r3 and bBf0r3 in bCf0r3 (which, since bBf0r3 is empty, is a copy of the third in bAf0r3).

Objects that are bigger than this would be broken into chunks that are 512 cu. ft. in volume, stored in equivalently numbered rooms across A & B, and for each A/B room pair, would also have a parity object chunk stored in the same numbered room in C, until the whole object have been stored. And if the last object chunk is 12 cu. ft. and is an odd numbered chunk, ordinally speaking, it would be stored in bAfNrN, and a copy of it stored in bCfNrN (parity of bAfNrN, and bBfNrN which is empty).

The stripesize = 3 rooms (across 3 buildings), or 1536 cu.ft. (512 x 3).

Combining what I already know with all of the original post, I deduce that:

1 or more physical disk sectors comprise a stripe
based on the type of RAIDZ chosen and the number of disks in the VDEV, there will be 2 or more data sectors within the stripe and 1 or more parity sectors within the stripe (all sectors are writable data sectors in a stripe with no parity)
the overall stripe size is a multiple of the underlying physical sector size, e.g. if the physical disk sector size is 1KB and we have a 4-disk VDEV, the stripe size is 4KB
IF the VDEV uses parity (RAIDZx) then the number of available data sectors for storing files is the disk-number minus parity-disk-number (e.g. 5-disk VDEV configured in RAIDZ2 = 3 data sectors per stripe available for file write).
files are allocated (written) to ‘stripes’, not ‘sectors’, so the number of stripes occupied by any given file is as follows:
ROUNDUP ( FILESIZE / DATA_SECTORS_PER_STRIPE ) = STRIPES USED

Have I understood this correctly?

(I think the main thing I am learning in this thread is that files are written to a multiple of stripes, rather than sectors. They are of course written to physical sectors, but indirectly by being allocated to stripes).

Blocks are still ultimately stored on physical sectors, so I am not sure exactly how they differ or what you really mean by this, unless I have basically misunderstood stripes in this context and my building/floor/room analogy is not correct when talking ZFS and a ‘stripe’ in ZFS is fundamentally different from a ‘stripe’ in any other RAID technology (which I don’t think it is).

Sorry, we are getting into the weeds, a bit, here

ericloewe · May 28, 2024, 3:14pm

Quite correct, although conceptually possible, nobody in their right mind would try to cram multiple filesystem blocks into a disk sector. It would be absolutely terrible.

As far as ZFS is concerned, the minimum ashift (and thus minimum practical recordsize) is bounded by the disk geometry, such that, for instance, a native 4k disk requires an ashift of 12 or greater and blocks smaller than 4k will be padded out to 4k.

HoneyBadger · May 28, 2024, 3:52pm

Disk devices will refuse to accept writes smaller than their native sector size - try to write a 512b sector to a 4Kn disk and it will throw I/O ERROR back at you.

Regarding the “stripe vs sector vs padding” argument, let me pull up an old post.

When you store data in a parity RAIDZ, you expect or hope that you receive storage efficiency in line with your vdev configuration - eg, if you’ve created a 6-wide RAIDZ2 (4+2) vdev, the hope is that you receive 4 drives of usable space out of those six. But for smaller blocks, this is less likely, because of how parity, record/volblocksize, ashift (sector size) and compression interact.

Let’s say you create a 6-wide Z2, using recommended ashift=12 yielding 4K as a minimum allocation unit, and you create a ZVOL with a 16KB volblocksize to hold your VMs.

If you happen to write 16K, and it doesn’t compress at all, then it will spread across all of the vdev members nicely - 16K/4K = 4 drives, and add two for parity. 66% space efficiency from raw , you’re happy.

But now you write 16K, and it compresses small enough to fit into 12K. You end up writing to 3 drives of data and 2 drives of parity - but hold on, RAIDZ also requires each allocation to be in multiples of P+1 (number of parity drives, in your case “2”) so you also end up with one drive holding padding data to make the total number of sectors fit the allocation rule of “must be a multiple of 3.” 50% space efficiency. So you’ve ended up with those six drives only yielding 3 drives of “usable space” - the same cost as a 3x2-way mirror vdev, but with significantly worse random I/O.

If if compresses even better - say, 8K - you’re going to write two sectors of data, and two of parity. But to meet the “must be multiple of 3” rule - add two sectors of padding. 33% efficiency. Now you’re using six spindles and getting 2 drives of “usable space” - the same as a mirror3 but with worse IOPS and worse redundancy.

Short version is that RAIDZ allocation rules and the resulting padding tends to eat up any space you think you were going to save from going RAIDZ (if you go too wide relative to your recordsize/volblocksize)

Compare the logical space used vs. the physical space allocated. You might find that it’s consuming the same as a mirror, and mirrors will definitely outstrip it from a performance perspective.

Note that the example there is a 6wZ2 - narrower RAIDZ and Z1 setups suffer less from this as the P+1 rule for Z1 is “multiples of two” which is easier to fit data into, and oftentimes you can actually realize the space efficiency in things like 5wZ1 when you use volblocksize=32K

Sara · May 28, 2024, 9:13pm

It does. See, I think you have some misconceptions. Maybe I have some misconceptions, and what I wrote is completely wrong. That is why I posted it on github and in the forums, so that other smart people would call me out.

I am not sure what these misconceptions are, and I try to explain it so maybe you can finde out. That is why I think you should forget everything that you learned about ZFS and RAID! Read my Github post (on github, not here! Here it is outdated and awful formatted) with a fresh open mind. Start by understanding datasets, then move to blockstorage. Just accept for fact that HDDs are 4k (ashift =12).

There are some strange 512e options but in 99,99% it will be 4k. Even for SSDs that may come with 8k, the controller is tuned for 4k, so you better leave it at 4k (last time I checked 2023).

That quote talks about datasets, and files bigger than the max size (unlike zvol, this is not a fixed number but a max number). This has nothing to do with the underling RAIDZ or mirror! This is just about how datasets behave. I think you should leave out this example and try to understand it, after you understood the things below that.

I would read that summery after you read the rest of the post.

Summary

If you set the max size (recordsize) of your dataset to 128k, you need four 128k stripes worth of data to get to your 512k file.

If you set the dataset recordsize to 64k, you would need 8 stripes.

If you set the dataset recordsize to 512k (possible since ZFS 2) you need one stripe.

Now how these stripes are saved to the disk is what I describe in mit Github.
But for simplicity, let us the default ashift 12 4k and assume we have recordsize set to 128k.

To get to 512k we need four 128k stripes.
Each stripe should store 128k data.

If our pool is just one disk, a single stripe would be 32 data blocks (each 4k) and 0 parity blocks. 32 times 4k = 128.
We write that stripe four times, and get our 512k file.

Now let’s assume our pool is three disks and a RAIDZ1.
A single stripe would still be 128 (because that is the max we set with recordsize).
To get to that 128k per stripe, we would need still need 32 data blocks (each 4k).
Drive 1 gets a data block, drive 2 gets a data block, drive 3 gets a parity block (I think it is more complicated in reality, but just for simplicity, we will work with that).
By that logic, 16 data blocks will go on drive1, 16 data blocks will go on drive2, and 16 parity blocks will go on drive3. So we have a 48 blocks or 192k wide stripe to store a 128k data stripe. We do that four times and get our 512k file written.

Actually, this is exactly what I mean. Stripe size is always the same for blockstorage but not for datasets.

I think the naming here is causing some confusion. One could use stripe size for the actual stripe of over the disks, one could use it for actual data in a stripe and the third one for how many stripes where needed to store a file with a certain size.

I love them too

How many floor is the TB size of the disk, so not really important for our analogy. We leave that out and assume the sky is the limit To make it even simpler, every floor only has one room. So rooms are not important, ditch them too.

Please no. 512 is very niche, and makes it unnecessarily difficult to understand.
Let us use 4 freedom units instead (4 freedom units = 4k).

These sizes examples you are using, are complicated examples, because they are special edge cases. I would suggest you start with these simpler examples.

No.
Much of it depends on what type of guy your landlord is.

Let us pretend our landlord is a dataset guy: We put the object in a room in building A and a parity in a room in building B. That way building A can catch on fire and we don’t care. Downside is we have to rent two floors to store one object. We only get 50% instead of the 66% we naturally assume to get with a 3 wide RAIDZ1.

Let us pretend our landlord is a zvol guy: He only rents out floors bundled in a 16 freedom units floor contract (or 4 floors). That is because 16 freedom units is the new default of ZFS zvols volblocksize. So we rent a floor1 in A, floor1 in B, floor3 in C and floor2 in A. We pay four floors.

We put the object1 in floor1 in A and a parity in floor1 in B. He does not care if we find another object to store, he still will charge us four floors!

But we are lucky, the OS gives us another 4 freedom units object called object2. We put that object in floor1 in C and floor2 in A.

But there is a catch. If we wanna change any object, our landlord asks us to completely cancel our contract! We have to move everything out of the houses! So we move all our stuff out of the house, sign the contract again, then move everything back in. (I think for ZFS it actually only asks us to rewrite the parity, since that is the only thing that changed besides the object we changed itself, but I am not sure). That is read write amplification!
This happens when our objects are smaller than your contract!

You mean something that is 8 freedom units big

Let us pretend our landlord is a dataset guy: He does not care, unless we rent less than 128 freedom units (default that can be changed) he will rent us a single contract. If we rent more than 128 freedom units but leave the default at 128k, he will split it up into two contracts. Not the file object, just the contract! Because he never rents out more than 32 floors (each 4 in size) in one single contract.

So we split our 8 freedom units big object into two. Now we have two 4 freedom unit objects.

We put object1 in a floor1 in building A, object2 in a floor1 in B. A third magic parity we put in floor1 in building C.

That is better, since we store 8 freedom units objects and pay 12 freedom units for that. 66% instead of the 50% from the example before. Hurray! Finally something that RAIDZ does better than mirror!

Let us pretend our landlord is a zvol guy: He only rents out floor bundled in a 16 freedom units floor contract (or 4 rooms). That is because 16 freedom units is the new default of ZFS zvols volblocksize. So we rent floor1 in A, floor1 in B, floor1 in C and floor2 in A. We pay four floors.

We use only 3. Object1 in floor1 building A, object2 in floor1 in building B and parity in floor1 in building C. We have one room left. Floor2 in building A is empty. Maybe the OS will find us another 4 freedom unit object so we can fill this block an recalculate the parity. Maybe not.

Protopia · March 26, 2025, 10:26pm

I think it would be useful to define what we mean by “big files” - do we mean (say) 256KB or more (which are pretty small files in my book compared to 1GB+ media streaming files) or do we really mean BIG, ENORMOUS, GIGANTIC files of (say) 128MB or more?

Yes - but in hardware RAID5 the storage efficiency would be 33% because you would have to use a whole stripe of 12KB for storing 4KB. With RAIDZ you may get worse efficiency than expected because you only have 1 data block and 1 parity block, but it will never drop below 50% and the unused block will be used by another short record.

I don’t understand this dividing by two and rounding up. The examples have already taken into account the stripe size (8KB = 2 blocks) - so I am unclear where this divisor by two comes with padding to then round up to a whole number.

Nevertheless these examples all show that RAIDZ has at least as good storage efficiency as mirrors and often better. So IF the issue for VMs/zVols/iSCSI/databases was actually storage efficiency, then wouldn’t RAIDZ win out every time.

But instead, as I understand it, the real point about mirrors are for VMs/zVols/iSCSI and databases where the software has its own file system that ZFS knows nothing about, and these do multiple reads and writes of only 4KB, and they are random and ZFS cannot predict where they will be on the disk and cannot do anything to optimize them. And these workloads also tend to have more intense IO, and need high IOPS, and also tend to need synchronous writes too. And on top of this, you can also get read and write amplification where e.g. a database has a page size of 4KB but the zVol block size is 16KB, so to read a 4KB page you have to read a 16KB record, and to write a random 4KB you need to read 16KB, replace the 4KB and then write 16KB - and this is read and write amplification. And these are the reasons to use mirrors for these workloads - it is the lower IOPS and the mismatch between large stipe sizes and the smaller zVol blocksize that make RAIDZ a bad choice for these workloads.

However, leaving aside mirrors vs. RAIDZ, these frequent small random accesses will cause a lot of seeks on HDDs - so putting them on SSDs avoids the performance hit from these seeks.

So now we should seem to be planning SSD mirrors for these zVol etc. workloads.

In addition these workloads also typically need synchronous writes for data integrity - and if they are on HDDs you pretty much have to have an SSD SLOG to get performance that is at all acceptable (because every ZIL write will have a seek) - but perhaps you can live with synchronous performance without a separate SLOG if the data is on SSD, especially if the data is on NVMe or Optane SSD.

So it seems to me (and I could be quite wrong here), that the rules of thumb are:

NVMe (or at least SSD) mirrors for your zVols and databases, perhaps with SLOG if you have even faster technology; and for these…
Keep the zVols to a minimum and put your sequential files on normal datasets accessed over NFS (probably still on SSDs) so that you can avoid the performance hit of synchronous writes and benefit from e.g. ZFS sequential prefetch;
Put your inactive sequential data, mostly at-rest data on HDD and make sure that most of the metadata accesses are satisfied from ARC, or a special allocation (metadata) vDev.

Obviously for enterprise usage, your active data and zVols and database files can still be bigger than SSDs can handle, and HDD mirrors are then the fallback, and then you absolutely need to invest in fast SLOGs.

Sara · March 27, 2025, 7:14am

Will edit that.

Fair point. I am not really concerned about RAID5, since it is IMHO dead, and more about how it compares to a ZFS mirror. But I will think about how I could integrate that 33% info.

The padding is used to avoid creating an “unfillable hole” later.
ZFS allocates space on RAIDZ vdevs in even multiples of p+1 sectors to prevent unusable-small gaps on the disk. p is the number of parity, so for the RAIDZ1 3 drives wide from the example you quoted, this would be 1+1=2.

So as I understand it, that RAIDZ1 can only be in 2,4,6,8,10,12… blocks or in 4k,8k,12k…
My assumptions are based on OpenZFS Capacity Calculator explanation.

On a technical nitpick, this is not true. A RAIDZ2 with 4 and volblocksize of 16k will offer worse efficiency than mirror. But sure, that is a strange edge case.

Reading on about your performance stuff, I think there is some misunderstanding of what I am trying to say.
I am 100% of the opinion that you should only use mirrors for blockstorage!
So I totally agree with:

But people seem to still ignore this recommendation out of “storage greed”.
The Proxmox forum is full of them. They all think since they only use SSDs, they no longer suffer from the fragmentation problem (which is kinda true) so they can use RAIDZ.

So this github post isn’t about how much better mirror is. This github post is that even if you decide against mirror and go with RAIDZ because of your storage greed, you will probably fall flat on your face!

Imagine a John with 10 disks.
For some obscure reasons I don’t have to understand, he wants to use a Windows Server with a huge virtual disk.
From that Windows Server VM he wants to share files for his small company. He creates a RAIDZ2 over these 10 disks, thinking that he will get 80% storage efficiency.
He knows that performance will be poor, but the storage greed is stronger.
What he does not understand is that he will not get 80% but only 66%.
That is pretty damn near the 50% a mirror offers.
If he knew beforehand, he might decided to go with mirrors instead, because he realized that there is not that much of a storage gain by going down the RAIDZ route.

Great rules, I will try to implement them into the conclusion.

The problem with RAIDZ

TLDR​

Introduction and glossary​

sector size:​

ashift:​

dataset:​

zvol:​

recordsize:​

volblocksize:​

padding:​

Dataset​

ZVOL and volblocksize​

volblocksize 16k​

RAIDZ1 with 3 drives

RAIDZ1 with 4 drives

RAIDZ1 with 5 drives

RAIDZ1 with 10 drives

RAIDZ2 with 4 drives

RAIDZ2 with 5 drives

RAIDZ2 with 6 drives

RAIDZ2 with 10 drives

volblocksize 64k​

RAIDZ1 with 3 drives

RAIDZ1 with 4 drives

RAIDZ1 with 5 drives

RAIDZ1 with 6 drives

RAIDZ1 with 9 drives

RAIDZ1 with 10 drives

RAIDZ2 with 6 drives

RAIDZ2 with 7 drives

RAIDZ2 with 10 drives

efficiency tables​

RAIDZ1​

RAIDZ2​

Conclusion​

TLDR

Introduction and glossary

sector size:

ashift:

dataset:

zvol:

recordsize:

volblocksize:

padding:

Dataset

ZVOL and volblocksize

volblocksize 16k

volblocksize 64k

efficiency tables

RAIDZ1

RAIDZ2

Conclusion