24.10 RC2 Raidz expansion caused miscalculated available storage

winnielinnie · October 18, 2024, 3:44am

I know. To “take something out of the equation” is a fancy way of saying that. Not sure what is the equivalent figure-of-speech over there in France.

EDIT: To show the figure-of-speech in a non-technical example.

Assistant: “This was before he found out what they did to his son!”

Detective: “That takes revenge completely out of the equation.”

(Detective is basically saying that revenge has nothing to do with the murder case.)

Stux · October 18, 2024, 3:46am

Sorry, i never thought it was part of the equation

Meanwhile, now with 8 wide RaidZ1

root@truenas[/home/truenas_admin]# zpool list test_pool && zfs list test_pool
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test_pool  7.98T  5.92M  7.98T        -         -     0%     0%  1.00x    ONLINE  /mnt
NAME        USED  AVAIL  REFER  MOUNTPOINT
test_pool  3.93M  5.19T   128K  /mnt/test_pool

Again, 0.66T increase… instead of 1T

5.19T when we expect about 7T.

…

Now 9 wide

root@truenas[/home/truenas_admin]# zpool list test_pool && zfs list test_pool
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test_pool  8.98T  6.17M  8.98T        -         -     0%     0%  1.00x    ONLINE  /mnt
NAME        USED  AVAIL  REFER  MOUNTPOINT
test_pool  4.11M  5.86T   128K  /mnt/test_pool

winnielinnie · October 18, 2024, 3:50am

This can’t be attributed solely to the GUI, either, since the zfs / zpool commands display the “wrong math”.

To re-use the meme: “How much space do I really have on my pool?”

What is a user to believe? Can we really blame them for being confused or frustrated?

Stux · October 18, 2024, 3:54am

And finally 10-wide.

root@truenas[/home/truenas_admin]# zpool list test_pool && zfs list test_pool
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test_pool  9.98T  6.04M  9.98T        -         -     0%     0%  1.00x    ONLINE  /mnt
NAME        USED  AVAIL  REFER  MOUNTPOINT
test_pool  3.93M  6.52T   128K  /mnt/test_pool

Which means, that it appears that 3.5T of parity overhead is present, when in reality its only 1T. This means that the parity overhead appears to be 3.5x higher than it should be.

Stux · October 18, 2024, 3:57am

Essentially, the original parity ratio is being used to calculate the space.

It would be even worse if I had’ve skipped the GUI and used 2-wide to start. Then I would expect to only see 0.5T increases per extension.

winnielinnie · October 18, 2024, 3:57am

Congratulations on your newly expanded, empty, 10-wide RAIDZ1 vdev, constructed with 1-TiB drives!

Enjoy your 9-TiB capacity pool…

NAME        USED  AVAIL  REFER  MOUNTPOINT
test_pool  3.93M  6.52T   128K  /mnt/test_pool

…oh wait.

winnielinnie · October 18, 2024, 3:59am

Now I’m curious what will happen if you dd (from /dev/urandom) a 6.5 TiB file? Will the pool claim you’re at 99% capacity? Will it “change” the pool’s actual capacity when it realizes what a liar it is, and then “correct” itself to claim you’re at 75% capacity?

Stux · October 18, 2024, 4:03am

Well. The actual host pool only has 4T available^[1]… heh, so it would be bad

Meanwhile, as I mentioned before, this is the relevant quote from the relevant bug

/*
 * Compute the raidz-deflation ratio.  Note, we hard-code 128k (1 << 17)
 * because it is the "typical" blocksize.  Even though SPA_MAXBLOCKSIZE
 * changed, this algorithm can not change, otherwise it would inconsistently
 * account for existing bp's.  We also hard-code txg 0 for the same reason
 * since expanded RAIDZ vdevs can use a different asize for different birth
 * txg's.
 */
static void
vdev_set_deflate_ratio(vdev_t *vd)
{
        if (vd == vd->vdev_top && !vd->vdev_ishole && vd->vdev_ashift != 0) {
                vd->vdev_deflate_ratio = (1 << 17) /
                    (vdev_psize_to_asize_txg(vd, 1 << 17, 0) >>
                    SPA_MINBLOCKSHIFT);
        }
}

So, if the code can’t be changed because its used for BP calculations, it seems like NEW code needs to be added for the ALLOC stats, and the BP calculations need to use the old code.

As mentioned before, I don’t know all the dependencies and requirements at this level of the ZFS code… but this situation is not good for end users.

which would be why I’m currently replacing 8T disks with 22T disks… I can run that test when it finishes… assuming that the expansion works and that isn’t buggy too. ↩︎

Stux · October 18, 2024, 4:50am

Interesting link

https://openzfs.topicbox.com/groups/developer/Tf89af487ee658da3/raidz-overhead-with-ashift12

When the “Theory of RaidZ Space Accounting” was merged into Illumos.

github.com

openzfs/zfs/blob/27e8f5610262177567b9aaebc6c9448d783aadd7/lib/libzfs/libzfs_dataset.c#L5367


      
          		default:
          			err = zfs_standard_error(hdl, errno, errbuf);
          			break;
          		}
          	}
          
          	return (err);
          }
          
          /*
           * The theory of raidz space accounting
           *
           * The "referenced" property of RAIDZ vdevs is scaled such that a 128KB block
           * will "reference" 128KB, even though it allocates more than that, to store the
           * parity information (and perhaps skip sectors). This concept of the
           * "referenced" (and other DMU space accounting) being lower than the allocated
           * space by a constant factor is called "raidz deflation."
           *
           * As mentioned above, the constant factor for raidz deflation assumes a 128KB
           * block size. However, zvols typically have a much smaller block size (default
           * 8KB). These smaller blocks may require proportionally much more parity

(and this thread is already getting hits on that above subject)

argumentum · October 18, 2024, 5:19am

… 90 posts, quite a lot.
Is this something that would be fixed in RC.3, or is just the way it is ?

Stux · October 18, 2024, 5:19am

A bit like “healing receive”.

Stux · October 18, 2024, 5:22am

Additionally… how does this impact the 80% warning that TrueNAS likes to use to encourage you extend your RaidZ pool… and the 90%/95% cliff?

Stux · October 18, 2024, 5:23am

I doubt that a change to RaidZ Space Accounting will be made between now and Electric Eel release.

Stux · October 18, 2024, 6:39am

From the original commit

When the expansion completes, the additional space is available for use, and is reflected in the available zfs property (as seen in zfs list, df, etc).

It is not fully reflected.

RAIDZ vdev’s “assumed parity ratio” does not change, so slightly less space than is expected may be reported for newly-written blocks, according to zfs list, df, ls -s, and similar tools.

And the above does not seem to fully capture the issue, except that it does state that the “assumed parity ratio” does not change, and that is what is used to determine how much space is available from the set of disks… apparently.

argumentum · October 18, 2024, 6:51am

I think that adding drives, and keeping files in the original drives is the problem.
By expanded new files are distributed among all drives but the old files remain in the prior ( original ) drives.
I think that upon the new setup, if the data is reshuffle among the drives, even tho slow, would solve the problem.

If they don’t fix it to our expectations, I’ll become a dev myself. Give me some time and I’ll fix it. Say … 'bout 10 years ?. Yea.

Stux · October 18, 2024, 7:06am

If it wasn’t clear, I did my tests with an empty pool.

The issue is (I believe) the capacity is determined by multiplying the size of the members of the vdev by the number of members, and then by the original parity ratio. (simplifying)

argumentum · October 18, 2024, 7:11am

True. hmm, …hope they do something about it. All this is so above me …

etorix · October 18, 2024, 12:19pm

If it were strange calculations by the middleware it could be fixed at any time.
But here it is deep into ZFS code. However faulty (hardcoding assumptions , including quite possibly a wrong one about blocksize?) this might be, it will take LOTS of time and efforts to fix. Possibly on the same order of magnitude as raidz expansion itself.

winnielinnie · October 18, 2024, 12:58pm

Do the ZFS devs even use ZFS regularly? How could they give a nod to this new feature (for OpenZFS 2.3)^[1] and think “Yeah, this is fine.”

Did no one during development say “Hey, guys. This doesn’t make sense and it might confuse the users.”

It’s such a glaring issue with pragmatic, real-world impact.

Love him or hate him, but having a “Steve Jobs” type of person at the helm bridges the gap between developers and end-users.

Sure, OpenZFS 2.3 is still in its RC stage, as of this post, but it’s unlikely that a release-candidate will see such a correction before its stable release. ↩︎

DjP-iX · October 18, 2024, 3:20pm

That’s more or less what Matthew Ahrens had to say (though he was talking about a slightly different part of this issue, I think)

github.com/openzfs/zfs

Comment by ahrens - RAIDZ Expansion feature

openzfs:master ← ahrens:raidz-expand

@louwrentius @teambvd @yorickdowne @mufunyo I think y'all are getting at a few d…ifferent questions: 1. What would be involved in re-allocating the existing blocks such that they use less space? Doing this properly - online, working with other existing features (snapshots, clones, dedup), without requiring tons of extra space - would require incrementally changing snapshots, which is a project of similar scale to RAIDZ Expansion. There are workarounds available that accomplish the same end result in restricted use cases (no snapshots? touch all blocks. plenty of space? `zfs send -R`). 2. How much benefit would it be? I gave a few examples above, but it's typically a few percent. (e.g. 5-wide -> 6-wide, you get at least 5/6th (83%) of a drive of additional usable space, and if you reallocated the existing blocks you could get a whole drive of usable space, i.e. an additional 17% of a drive, or ~3% of the whole pool. Wider RAIDZ's will see less impact (9-wide -> 10-wide, you get at least 90% of a drive of additional usable spacespace; you're missing out on 1% of the whole pool). 3. Why didn't I do this yet? Because it's an incredible amount of work for little benefit, and I believe that RAIDZ Expansion as designed and implemented is useful for a lot of people. All that said, I'd be happy to be proven wrong about the difficulty of this! Such a facility could be used for many other tasks, e.g. recompressing existing data to save more space (lz4 -> zstd). If anyone has ideas on how this could be implemented, maybe we can discuss them on the [discussion forum](https://github.com/openzfs/zfs/discussions) or in a new [feature request](https://github.com/openzfs/zfs/issues/new?assignees=&labels=Type%3A+Feature&template=feature_request.md&title=). Another area that folks could help with is automating the rewrite in restricted use cases (by touching all blocks or `zfs send -R`).

…which is a project of similar scale to RAIDZ Expansion