VDEV balance strategy?

dxun · February 26, 2025, 10:12pm

I’ve been doing an inplace rebalance of a mirror pool consisting of 4 vdevs and…I am basically through (about 500 G to go) but I just don’t see the expected result.

Here is how this pool’s occupancy currently looks like:

NAME                                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank                                      49.2T  12.2T  37.0T        -         -     0%    24%  1.00x    ONLINE  /mnt
  mirror-0                                9.08T  2.89T  6.18T        -         -     0%  31.9%      -    ONLINE
    cd246c87-d8b7-4dba-a44e-2d1601887d2f  9.09T      -      -        -         -      -      -      -    ONLINE
    15f70a0b-418d-4132-bc5d-eb01506fabb4  9.09T      -      -        -         -      -      -      -    ONLINE
  mirror-1                                9.08T  2.90T  6.18T        -         -     0%  31.9%      -    ONLINE
    37f7360d-5aa1-43a3-9d75-77ba2d01e908  9.09T      -      -        -         -      -      -      -    ONLINE
    84629ad7-85a2-4cf5-9468-aaa885247872  9.09T      -      -        -         -      -      -      -    ONLINE
  mirror-2                                12.7T  3.06T  9.66T        -         -     0%  24.1%      -    ONLINE
    0ca180d2-c6f7-4d4e-90fc-8e255d5634b1  12.7T      -      -        -         -      -      -      -    ONLINE
    ad8fc28e-844d-424d-8dfe-8baa0c5e661d  12.7T      -      -        -         -      -      -      -    ONLINE
  mirror-7                                18.2T  3.38T  14.8T        -         -     0%  18.6%      -    ONLINE
    05912812-1e30-4b1f-9d7d-24b982641584  18.2T      -      -        -         -      -      -      -    ONLINE
    c2d778a1-e57f-4005-bf8f-ddc88aaa972e  18.2T      -      -        -         -      -      -      -    ONLINE
special                                       -      -      -        -         -      -      -      -         -
  mirror-3                                 186G  2.11G   184G        -         -     4%  1.13%      -    ONLINE
    333027e6-fe08-4983-9a4a-72ed66d0b57f   186G      -      -        -         -      -      -      -    ONLINE
    1bfc8fd2-90d0-4178-9da3-0d75e2110a51   186G      -      -        -         -      -      -      -    ONLINE
logs                                          -      -      -        -         -      -      -      -         -
  mirror-6                                15.5G    16K  15.5G        -         -     0%  0.00%      -    ONLINE
    ce3e51ea-cf36-4f3e-9ae0-f3d7a19a48b1    16G      -      -        -         -      -      -      -    ONLINE
    d36a5367-91e4-40f1-b1d5-dfe2539c9ce5    16G      -      -        -         -      -      -      -    ONLINE

As you can see, the pool occupancy is about 25% and I expected all vdevs to be around that same CAP, but that didn’t happen and I curious if I did something wrong or are we looking at an anomaly here.

My understanding is that steady-state pool IOPS falls sharply from 10% occupancy onward (link here, p.6).

I would have expected ZFS to try and maintain the per vdev occupancy parity at all times so that IOPS is relatively consistent across the pool, but that doesn’t seem to be the case. I also don’t think that having different-sized vdevs should somehow skew this strategy.

The only explanation I have for the behaviour observed is that ZFS doesn’t target equal CAP, but rather equal ALLOC…but that that makes no sense to me.

What don’t I know here?

EDIT: I removed the snapshots from the pool prior to rebalance

winnielinnie · February 26, 2025, 10:40pm

The mirrors are different sizes. Isn’t that comparing apples to oranges?

There’s probably logic in the way ZFS spreads its writes that also takes into account the size of the vdevs.

gedavids · February 26, 2025, 10:41pm

My understanding of the ZFS vdev strategy is that it is not based on CAP nor ALLOC, but rather vdev latency. Let me explain as best I understand it.

Say you have a bunch of data you want to write. ZFS starts breaking it up into allocation units, and in your case of 4 vdevs, sends 1 allocation unit to each vdev and then it waits on the drives. The first vdev that comes back and says it’s done gets the next allocation unit, and so on until you’ve finished writing all the data.

At first glance this seems like it’s allocating data evenly to all the vdevs, and initially it will, but as it the vdevs start to fill up the story will change. Take your system, as mirror-0 and mirror-1 start to get more full (they’re smaller after all) their seek times will naturally start to increase. They’ll start to take longer to complete their writes than the other 2 vdevs and thus writes will begin to be biased to the larger vdevs more. So overall the vdev latency will remain similar.

At least that’s my understanding of the system, someone more knowledgeable please correct me if I got that wrong.

dxun · February 27, 2025, 3:07am

I don’t think so? I’d like to look at the code but the way I’d naively do this is by keeping the CAP across all pool members equal.

Interesting - so ZFS actually keeps track of write op timings per VDEV and then decides where to actually write the data? Quite sophisticated, I did not expect this.

So in the example from OP, the ZFS preferentially writes to 10 TB drives over 14 and 20 TB ones? I’d expect newer and bigger drives to perform “better”…as in - have lower write seek times, all other things being equal?

gedavids · February 27, 2025, 3:59am

Not quite that sophisticated, you might be overthinking it. It prefers whatever vdev is “ready”. Since all your drives are relatively large and mostly empty still, the allocation of data has remained roughly even.

If you were to say triple the amount of data you’re storing the allocation pattern would change. The 10T’s would be nearly full and writes would become slow, so those drives would be ready for more data less often than the larger vdevs.

dxun · February 27, 2025, 5:30am

Thank you - I’ll keep monitoring occasionally unless there is something to be concerned with the current distribution of data?

gedavids · February 27, 2025, 2:58pm

I see nothing concerning there.

richardm · March 22, 2025, 5:37am

If my [AI-assisted] interpretation of OpenZFS code is correct, vdev write queues are populated based on least busy with a bias toward least full. The more imbalanced they become (space-wise) the stronger this bias becomes. So there’s likely to be some space imbalance but there’s a limit to how bad it’ll get (in theory).

ratio = (vs_free * mc->mc_alloc_groups * 100) / (mc_free + 1);
mg->mg_bias = ((ratio - 100) * (int64_t)mg->mg_aliquot) / 100;

Where:

vs_free is the free space on the vdev
mc_free is the free space across all vdevs in the metaslab class
mg_bias is the bias value that will influence allocation decisions

The mg_bias value is calculated within the metaslab allocation code, specifically in the section that handles vdev selection for writes. This calculation occurs after a successful allocation but before determining whether to rotate to the next vdev:

ratio = (vs_free * mc->mc_alloc_groups * 100) / (mc_free + 1);
mg->mg_bias = ((ratio - 100) * (int64_t)mg->mg_aliquot) / 100;

vs_free: The free space on the current vdev, calculated as vs->vs_space - vs->vs_alloc
vs_space is the total usable space on the vdev
vs_alloc is the space already allocated on the vdev
mc->mc_alloc_groups: The number of metaslab groups (top-level vdevs) in the pool
mc_free: The total free space across all vdevs in the metaslab class, calculated as mc->mc_space - mc->mc_alloc
mg->mg_aliquot: The base allocation granularity value (metaslab.aliquot) for the current metaslab group

The calculation happens in two steps:

First, it calculates a ratio that compares this vdev’s free space to the average free space across all vdevs:

If ratio = 100, this vdev has exactly the average amount of free space
If ratio > 100, this vdev has more free space than average
If ratio < 100, this vdev has less free space than average

Then it calculates the actual bias by:

Subtracting 100 from the ratio (creating a positive or negative value)
Multiplying by the aliquot value (scaling the bias based on the base allocation size)
Dividing by 100 (converting the percentage back to a proportion)

The resulting mg_bias value directly affects when ZFS rotates to the next vdev:

if (atomic_add_64_nv(&mca->mca_aliquot, asize) >= mg->mg_aliquot + mg->mg_bias)

For vdevs with more free space than average: mg_bias is positive, allowing the vdev to receive more data before rotation
For vdevs with less free space than average: mg_bias is negative, causing the vdev to receive less data before rotation
For vdevs with average free space: mg_bias is zero, and standard rotation occurs

Regarding your imbalance… What is the typical recordsize in your pool? Recordsizes in excess of zfs.metaslab_aliquot appear to reduce the effectiveness of the balancing mechanism…

gedavids · March 27, 2025, 7:31pm

Sweeeeeeet, I get to learn something new.