24.10 RC2 Raidz expansion caused miscalculated available storage

winnielinnie · October 17, 2024, 7:37pm

Isn’t ZFS wonderful?

So if my pool shows “73 GiB available space remaining”, I could still save a 103-GiB video file, if I wanted to.

Don’t ask how! Something to do with parity, calculations, block-striping, vdev width, new vs old math, “compression” (for a video file, of all things), and… good ol’ ZFS magic!

It’s actually very intuitive.

sfatula · October 17, 2024, 7:42pm

Video files do not compress like that. It’s not possible.

Hittsy · October 17, 2024, 7:48pm

this is all because zfs does not want to recalculate parity data (which I forced it to do with the funny script anyhow)… so zfs has to use old values for calculating available storage, so my pool is left in a zombie state where who knows how much space is taken up by anything.

winnielinnie · October 17, 2024, 7:53pm

Then it sounds like “RAIDZ expansion” isn’t a fleshed out feature. It does what it’s meant to do… and that’s it. It stops there. From a developer’s and engineer’s perspective, the work is finished. No followup on real-world usage.

From an end-user’s perspective, it’s important (and expected) to be able to intuitively and pragmatically understand how much space is being consumed by a file, how large the pool’s total capacity is, and how much free space is available.

A 30% discrepancy is not acceptable.

sfatula · October 17, 2024, 7:56pm

Right, but as I said elsewhere, this is the first release with this feature and I wouldn’t personally use it until a few zfs versions later to let the bugs be fleshed out. Same reason I won’t install Eel 24.10.0. It is not surprising at all to me there may be issues, possibly even data loss. Just my opinion.

yorick · October 17, 2024, 8:03pm

Yep exactly, making a ZFS target with cheap SAS drives, replicate over just the data I actually care about, destroy 5-wide and make an 8-wide, replicate the other way.

Between replication and burn in it took about 2 weeks.

The raidz expansion looks more comfortable, and doesn’t need a second system to replicate to. If / when the display oddities can be fixed, it should be solid. Another 3 years?

winnielinnie · October 17, 2024, 8:06pm

That’s the equivalent of one week in “GIMP 3 development time”.

What’s going to arrive first? The “Year of the Linux Desktop” or the “Year of GIMP 3’s Release”?

Bagginses · October 17, 2024, 8:08pm

If there was data loss with basic usage of the system at this point in the development I would be very concerned. For me it’s more of an annoyance that the storage usage and availability isn’t reporting correctly. As long as 24.10.0 or 1-2 iterations later fixes the issue for me the tradeoff is worth it to get the extra space with RAIDZ expansion. Yes it is annoying that it appears that there’s less space available, but unless it’s a production system I’m not worried about it.

Bagginses · October 17, 2024, 8:16pm

If I had the drive space to do this with ~15TB I would have done it this way. That being said I’m not buying extra drives just to temporarily copy a bunch of data to them before recreating the new pool. This becomes even tougher as data usage increases.
Also, please be MUCH less than 3 years😅

Krill · October 17, 2024, 9:38pm

I don’t work in tech, but if I reported that something using X amount of something but I knew it was actually using Y amount and it caused a negative outcome I’d expect to get fired.

OTOH, the reality is that no boss of mine would ever figure what had gone wrong so

Stux · October 17, 2024, 9:49pm

So… what ya do, is you work out how many raidz stripes are filled… then you multiply that by the original raidz width… and now you know the apparent size.

Or something dumb like that.

Does ZFS not know the parity ratio of every file written since the expansion finished enough to enable the new parity ratio?

Should not the new parity ratio be used when working out available size?

winnielinnie · October 17, 2024, 9:51pm

It’s actually not that complicated. There’s a quick formula you can use to demystify how much total capacity you actually have and how much space is available.

Stux · October 17, 2024, 9:53pm

Try a 2wZ1 to 10wZ1 expansion…

Krill · October 17, 2024, 10:05pm

The correct implementation in Truenas should have been to expand the vdev, run the scrub, and then immediately run the rebalance script, not as a random user, but as part of the expansion process and then force the recalculation of pool size.

Now someone will come along as say “but we couldn’t do that”. This might be a ZFS expansion issue, but it’s solvable for the appliance OS that iX is building.

And if the rebalance script (or an implementation created by iX) still has this capacity error issue, then I think iX have a big problem with a major part of this Scale release from a marketing perspective.

awalkerix · October 18, 2024, 1:37am

github.com

truenas/zfs/blob/3b78e86ec9b76c7df855991982fae64b9cda7266/module/zfs/dmu.c#L2790


      
          
          	DB_DNODE_ENTER(db);
          	dmu_object_info_from_dnode(DB_DNODE(db), doi);
          	DB_DNODE_EXIT(db);
          }
          
          /*
           * Faster still when you only care about the size.
           */
          void
          dmu_object_size_from_db(dmu_buf_t *db_fake, uint32_t *blksize,
              u_longlong_t *nblk512)
          {
          	dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
          	dnode_t *dn;
          
          	DB_DNODE_ENTER(db);
          	dn = DB_DNODE(db);
          
          	*blksize = dn->dn_datablksz;
          	/* add in number of slots used for the dnode itself */

Regarding “allocation size” in samba: this is what is happening. ZFS gets count of 512 byte blocks for specified object and returns as st_blocks in stat(2) output. Samba and other applications multiply this by 512 to determine the “allocation size” as opposed to st_size.

github.com

truenas/samba/blob/b22bfc87ce412e8e7eddffb01b94cb9811d311a1/source3/modules/vfs_default.c#L2616


      
          					struct files_struct *fsp,
          					uint16_t compression_fmt)
          {
          	return NT_STATUS_INVALID_DEVICE_REQUEST;
          }
          
          /********************************************************************
           Given a stat buffer return the allocated size on disk, taking into
           account sparse files.
          ********************************************************************/
          static uint64_t vfswrap_get_alloc_size(vfs_handle_struct *handle,
          				       struct files_struct *fsp,
          				       const SMB_STRUCT_STAT *sbuf)
          {
          	uint64_t result;
          
          	START_PROFILE(syscall_get_alloc_size);
          
          	if(S_ISDIR(sbuf->st_ex_mode)) {
          		result = 0;
          		goto out;

So no samba bug here.

That said, ZFS space accounting is complicated.

winnielinnie · October 18, 2024, 3:26am

Understatement of the century.

Stux · October 18, 2024, 3:34am

I’ve just created a VM, with EE RC.2 installed… and 10 1TiB virtual disks.

Created a 3wZ1 pool, then extended twice…

root@truenas[/home/truenas_admin]# zpool status test_pool
  pool: test_pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:00 with 0 errors on Thu Oct 17 20:27:26 2024
expand: expanded raidz1-0 copied 6.02M in 00:00:01, on Thu Oct 17 20:27:26 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        test_pool                                 ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            115cd1b7-a793-414e-a129-45e2b957b1bd  ONLINE       0     0     0
            3e4a8cf3-9b3e-4e62-86e8-130ef583a51c  ONLINE       0     0     0
            c68390ef-3625-42ba-a584-aa9d6f6b3516  ONLINE       0     0     0
            d3cc02b6-de80-4c5e-a53a-1ff18ddb6e58  ONLINE       0     0     0
            e224f009-c449-4ca9-852c-c792f6e86a0e  ONLINE       0     0     0

errors: No known data errors

Okay, that looks fine.

root@truenas[/home/truenas_admin]# zpool list test_pool
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test_pool  4.98T  5.02M  4.98T        -         -     0%     0%  1.00x    ONLINE  /mnt

Okay, 5w so 5T “free”

root@truenas[/home/truenas_admin]# zfs list test_pool
NAME        USED  AVAIL  REFER  MOUNTPOINT
test_pool  3.32M  3.22T   128K  /mnt/test_pool

But you’d expect to see 4T available…

Stux · October 18, 2024, 3:38am

Extended to 6w

root@truenas[/home/truenas_admin]# zpool list test_pool
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test_pool  5.98T  5.41M  5.98T        -         -     0%     0%  1.00x    ONLINE  /mnt
root@truenas[/home/truenas_admin]# zfs list test_pool
NAME        USED  AVAIL  REFER  MOUNTPOINT
test_pool  3.51M  3.86T   128K  /mnt/test_pool

Adds another 2/3rds of a TiB, which is the original parity ratio.

…

add another disk… now 7w

root@truenas[/home/truenas_admin]# zpool list test_pool && zfs list test_pool
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
test_pool  6.98T  6.30M  6.98T        -         -     0%     0%  1.00x    ONLINE  /mnt
NAME        USED  AVAIL  REFER  MOUNTPOINT
test_pool  3.82M  4.53T   128K  /mnt/test_pool

Again, another 0.67 T gained… again, 2/3s of TiB… should be 1T.

winnielinnie · October 18, 2024, 3:41am

Which doesn’t even make sense, since you didn’t write much (if anything) on the pool, even before expanding the RAIDZ1 dev.

That takes “rebalancing” completely out of the equation.

So is ZFS “lying”? Can you, as the user, take at face-value the values and capacities it reports?

Stux · October 18, 2024, 3:42am

Rebalancing has nothing to do with it.