24.10 RC2 Raidz expansion caused miscalculated available storage

I know. To “take something out of the equation” is a fancy way of saying that. :wink: Not sure what is the equivalent figure-of-speech over there in France.

EDIT: To show the figure-of-speech in a non-technical example.

Assistant: “This was before he found out what they did to his son!”

Detective: “That takes revenge completely out of the equation.”

(Detective is basically saying that revenge has nothing to do with the murder case.)

Sorry, i never thought it was part of the equation :slight_smile:

Meanwhile, now with 8 wide RaidZ1

root@truenas[/home/truenas_admin]# zpool list test_pool && zfs list test_pool
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test_pool  7.98T  5.92M  7.98T        -         -     0%     0%  1.00x    ONLINE  /mnt
NAME        USED  AVAIL  REFER  MOUNTPOINT
test_pool  3.93M  5.19T   128K  /mnt/test_pool

Again, 0.66T increase… instead of 1T

5.19T when we expect about 7T.

Now 9 wide

root@truenas[/home/truenas_admin]# zpool list test_pool && zfs list test_pool
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test_pool  8.98T  6.17M  8.98T        -         -     0%     0%  1.00x    ONLINE  /mnt
NAME        USED  AVAIL  REFER  MOUNTPOINT
test_pool  4.11M  5.86T   128K  /mnt/test_pool

This can’t be attributed solely to the GUI, either, since the zfs / zpool commands display the “wrong math”.

To re-use the meme: “How much space do I really have on my pool?”

What is a user to believe? Can we really blame them for being confused or frustrated?

And finally 10-wide.

root@truenas[/home/truenas_admin]# zpool list test_pool && zfs list test_pool
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test_pool  9.98T  6.04M  9.98T        -         -     0%     0%  1.00x    ONLINE  /mnt
NAME        USED  AVAIL  REFER  MOUNTPOINT
test_pool  3.93M  6.52T   128K  /mnt/test_pool

Which means, that it appears that 3.5T of parity overhead is present, when in reality its only 1T. This means that the parity overhead appears to be 3.5x higher than it should be.

1 Like

Essentially, the original parity ratio is being used to calculate the space.

It would be even worse if I had’ve skipped the GUI and used 2-wide to start. Then I would expect to only see 0.5T increases per extension.

2 Likes

Congratulations on your newly expanded, empty, 10-wide RAIDZ1 vdev, constructed with 1-TiB drives! :partying_face:

Enjoy your 9-TiB capacity pool…

NAME        USED  AVAIL  REFER  MOUNTPOINT
test_pool  3.93M  6.52T   128K  /mnt/test_pool

…oh wait. :neutral_face:

1 Like

Now I’m curious what will happen if you dd (from /dev/urandom) a 6.5 TiB file? Will the pool claim you’re at 99% capacity? Will it “change” the pool’s actual capacity when it realizes what a liar it is, and then “correct” itself to claim you’re at 75% capacity?

Well. The actual host pool only has 4T available[1]… heh, so it would be bad :wink:

Meanwhile, as I mentioned before, this is the relevant quote from the relevant bug

/*
 * Compute the raidz-deflation ratio.  Note, we hard-code 128k (1 << 17)
 * because it is the "typical" blocksize.  Even though SPA_MAXBLOCKSIZE
 * changed, this algorithm can not change, otherwise it would inconsistently
 * account for existing bp's.  We also hard-code txg 0 for the same reason
 * since expanded RAIDZ vdevs can use a different asize for different birth
 * txg's.
 */
static void
vdev_set_deflate_ratio(vdev_t *vd)
{
        if (vd == vd->vdev_top && !vd->vdev_ishole && vd->vdev_ashift != 0) {
                vd->vdev_deflate_ratio = (1 << 17) /
                    (vdev_psize_to_asize_txg(vd, 1 << 17, 0) >>
                    SPA_MINBLOCKSHIFT);
        }
}

So, if the code can’t be changed because its used for BP calculations, it seems like NEW code needs to be added for the ALLOC stats, and the BP calculations need to use the old code.

As mentioned before, I don’t know all the dependencies and requirements at this level of the ZFS code… but this situation is not good for end users.


  1. which would be why I’m currently replacing 8T disks with 22T disks… I can run that test when it finishes… assuming that the expansion works and that isn’t buggy too. ↩︎

3 Likes

Interesting link

https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3/raidz-overhead-with-ashift12

When the “Theory of RaidZ Space Accounting” was merged into Illumos.

(and this thread is already getting hits on that above subject)

… 90 posts, quite a lot.
Is this something that would be fixed in RC.3, or is just the way it is ?

A bit like “healing receive”.

2 Likes

Additionally… how does this impact the 80% warning that TrueNAS likes to use to encourage you extend your RaidZ pool… and the 90%/95% cliff?

2 Likes

I doubt that a change to RaidZ Space Accounting will be made between now and Electric Eel release.

1 Like

From the original commit

When the expansion completes, the additional space is available for use, and is reflected in the available zfs property (as seen in zfs list, df, etc).

It is not fully reflected.

RAIDZ vdev’s “assumed parity ratio” does not change, so slightly less space than is expected may be reported for newly-written blocks, according to zfs list, df, ls -s, and similar tools.

And the above does not seem to fully capture the issue, except that it does state that the “assumed parity ratio” does not change, and that is what is used to determine how much space is available from the set of disks… apparently.

3 Likes

I think that adding drives, and keeping files in the original drives is the problem.
By expanded new files are distributed among all drives but the old files remain in the prior ( original ) drives.
I think that upon the new setup, if the data is reshuffle among the drives, even tho slow, would solve the problem.

If they don’t fix it to our expectations, I’ll become a dev myself. Give me some time and I’ll fix it. Say … 'bout 10 years ?. Yea. :baby:

If it wasn’t clear, I did my tests with an empty pool.

The issue is (I believe) the capacity is determined by multiplying the size of the members of the vdev by the number of members, and then by the original parity ratio. (simplifying)

2 Likes

True. hmm, …hope they do something about it. All this is so above me … :man_shrugging:

If it were strange calculations by the middleware it could be fixed at any time.
But here it is deep into ZFS code. However faulty (hardcoding assumptions :scream:, including quite possibly a wrong one about blocksize?) this might be, it will take LOTS of time and efforts to fix. Possibly on the same order of magnitude as raidz expansion itself.

3 Likes

Do the ZFS devs even use ZFS regularly? How could they give a nod to this new feature (for OpenZFS 2.3)[1] and think “Yeah, this is fine.”

Did no one during development say “Hey, guys. This doesn’t make sense and it might confuse the users.”

It’s such a glaring issue with pragmatic, real-world impact.

Love him or hate him, but having a “Steve Jobs” type of person at the helm bridges the gap between developers and end-users.


  1. Sure, OpenZFS 2.3 is still in its RC stage, as of this post, but it’s unlikely that a release-candidate will see such a correction before its stable release. ↩︎

1 Like

That’s more or less what Matthew Ahrens had to say (though he was talking about a slightly different part of this issue, I think)

…which is a project of similar scale to RAIDZ Expansion