24.10 RC2 Raidz expansion caused miscalculated available storage

winnielinnie · October 18, 2024, 3:28pm

That’s a different issue Matthew Ahrens is referring to.

There are two things going on, one of which affects end-users in a more obvious manner.

The “more efficient” storage of new blocks written after RAIDZ expansion, of which existing blocks do not benefit. (This is what Matthew is referring to, which involves the discussion around “rebalancing” and “in-place rewriting of existing data”.)
ZFS displays wholly inaccurate (and honestly, comical) space and capacity info about a pool. (This was demonstrated by @Stux earlier in the thread.)

It’s point #2 that I view as a glaring issue, and it’s not a good look for ZFS.

ericloewe · October 18, 2024, 3:29pm

What am I missing here?

The one thing ZFS does know exactly, in terms of space, is how much space is used.
Knowing how much space is available should just use the same approximations and the same math as it has for a long time, but with the updated stripe width rather than the old one (old data is written anyway and doesn’t factor in, new data takes advantage of the new ratio).

So this really should not be happening, according to my understanding.

winnielinnie · October 18, 2024, 3:32pm

Branching off from here.

@Roveer, can you share this information?

The following commands will help shed some light:

zfs --version

zpool list TRUENASPOOL

zfs list -o space TRUENASPOOL

Furthermore, was this pool created “all-at-once”, or were vdevs added later?

DjP-iX · October 18, 2024, 3:37pm

That’s what I was glossing over by calling it a different part of the issue, but yeah it seems like there are two different issues that would require a significant amount of development to fix

winnielinnie · October 18, 2024, 3:42pm

@Hittsy

Can you remove the “Solution” from this thread?

It’s not really a “solution”, per se, since it basically acknowledges “yes, this is an issue, no longer a mystery”.

The thread has evolved into more of a discussion and exploration about this phenomenon with OpenZFS 2.3 (and maybe RAIDZ in general).

Talk about making quite the splash with your very first post on the forums!

Hittsy · October 18, 2024, 3:45pm

Whynot; I still can’t trust any numbers beyond the used capacity % on the storage dashboard.

etorix · October 18, 2024, 4:14pm

Not being familiar with ZFS code, I can only offer speculation, but merely reporting used/available space, on existing data, should be entirely different from rewriting data in-place.

Why is ZFS sticking to the parity:data ratio the vdev was created with in the first place?
One would think that following expansion the current value would be used for calculations, in particular calculations involving free space.

sfatula · October 18, 2024, 5:47pm

Wouldn’t it still be wrong if it used the current parity:data ratio as all of the old data would be at the old ratio unless all rebalanced? I would presume on average, pools expanded are pretty full so the effect is lessened on mostly full pools.

Stux · October 18, 2024, 6:19pm

Capacity vs Allocated.

Free space should be capacity - allocated

But it has to approach zero as it fills. So the math gets tricky.

It seems to me that there needs to be an update to the capacity logic. OR a new type of capacity “expandcapacity” etc.

PhilD13 · October 18, 2024, 6:26pm

I would agree. The old data unless rebalanced would remain calculated as was using the old parity:data ratio. The new data written would be based on the new parity:data ratio of the expansion. The reporting should reflect that in making an entry in reporting that reflects both old parity:data ratio and new data:parity ratio.

Krill · October 19, 2024, 10:40am

Oh this is exciting

Stux, do you have any plans to test the actual capacity limit of an expanded dvdev? I reckon that iX are likely testing these limits right now, wouldn’t do for an enterprise customer to lose production data so it has to be done before release…but I doubt that all of these details would be made public.

On a more pertinent note, this is likely going to be acknowledged as a bit of a debacle in the months to come, with no quick fix, so a question for those of us who have already used the expansion feature: are we stuck with the old choice of back up/destroy/recreate pool answer to get data back to a safe place? (I believe that we cannot assume data on an expanded vdev is safe at this point in time, but it would be really nice if I was wrong on that.

ericloewe · October 19, 2024, 1:51pm

I have seen no indications whatsoever that the feature is unsafe (I’m personally hesitant on the risk/benefit at this point, but that’s not based on anything other than it being new).

The reality, without wanting to excuse potential bugs in either the logic or implementation of the post-expansion space accounting, is that ZFS’ estimate of available space has always been a rough indication, because it is literally impossible to predict how much we will be able to fit. But that has never been a reflection of possible dangers to data.

winnielinnie · October 19, 2024, 2:36pm

Agreed. This is not “unsafe” on the surface. But for a filesystem that boasts resiliency, stability, and flexibility, it’s just silly to not even display a pool capacity that reflects a fairly accurate size. (Because it’s using a “pre-expansion” calculation, it’s okay for it to lie about how big your pool really is? Sure, maybe for a new filesystem being developed, or some beta that is experimenting with a new feature, but for a stable release of a “mature” filesystem?)

And while it’s not outright “unsafe”, it affects how the user stores and manages their data. Imagine that you believe there is only 2 TiB available on your pool (when in reality you actually have 7 TiB usable space remaining.) Would that not play a factor in your purchasing decisions to expand your pool further? In your pruning decisions? In your snapshot management?

This issue is more than just a poor estimate.

I can understand imperfect displays of “total pool capacity” and “available space”, due to padding, reserved metadata, checksums, predictable parity (unavailable for user data, and hence not “usable”), inline compression, and etc. Basically, things that can amount to “rounding errors”.

But this issue, as demonstrated by @Hittsy and @Stux is just outright wrong, on the scale of almost an entire 18 TiB’s worth of usable storage space. (Depending on the underlying member drive capacities.) There’s really no excuse for it. We can’t expect the user to “calculate the real numbers” in their head.

When it comes to things like inline compression (which can indeed offer “more space”, depending on what types of files you are saving), the total pool capacity is a known amount, and remains as such. Even the available remaining space is fairly accurate. (You might have 2 TiB of “available space”, and decide to save 1 TiB of highly-compressible files. These files end up only consuming 500 GiB on the pool. Predictably, you still have 1.5 TiB of available space, and your pool’s total capacity remains consistent. Any “discrepancy” can amount to a rounding error, which is acceptable for most users.)

While RAIDZ space calculation isn’t a precise estimate, you at least get an idea of “This is how big my pool is.” However, after you expand your RAIDZ vdev you’re expected to carry a calculator with you to understand how big your pool really is, let alone how much available space remains.

Just like “corrective receives” introduced with OpenZFS 2.2, this new feature introduced with 2.3 seems unfinished but “good enough”, where the developers just figured “It does what it needs to do.” It carries a degree of “hobbyist” project.

It doesn’t come off as professional quality.

raidz-survival-pack

winnielinnie · October 19, 2024, 4:58pm

I should also add that this has implications for how much a file supposedly consumes on a pool.

The readout, whether from a file browser or from ZFS itself, is outright misleading.

See posts 43, 46, 52, and 60.

argumentum · October 19, 2024, 5:01pm

pic. of 4x4tb-z2-no compression

pic. of 8x4tb-z2-no compression, expanded

pic. of 8x4tb-z2- ReDo

Did the footwork without compression to see if calculated expansion properly and, nothing. ( all this in a VM with dynamic disks don’t take much space so anyone with a hypervisor can test these out )

To me the calculations should be with a compression of 1.0 in mind for the total available, and subtract the physical space ( that is known given that the system tells the compression ratio ).

I don’t see a reason for the miscalculation hence I’ll take this new feature as immature for now, no matter what the integrators ( iX ) may say.

I don’t know how to make memes but picture the “change my mind” one.

winnielinnie · October 19, 2024, 5:09pm

Isn’t that hilarious?

With an empty pool, you either have a total usable capacity of 15.73 TiB or 22.61 TiB.

It’s Schrödinger’s ZFS!

User: “Does my 8-wide RAIDZ2, comprised of 4-TiB drives, yield approximately 15 TiB or 22 TiB pool capacity?”

ZFS: “Yes.”

buedi · October 19, 2024, 7:32pm

I find this thread very interesting and… concerning, so I had to do some testing too before buying the drivers for my Terramaster F8 SSD Plus that just came in today. Since I do not have the final drives yet and I do not have the big disks / space as you people seem to have, I installed XCP-ng on it and TrueNAS in a VM.

First, I wanted to see if all those numbers change with the size of the Disks. I made tests with 1 GiB virtual Disks and 1 TiB virtual Disks and the numbers are exactly the same, it is just the suffix that changes. With that out of the way, I did my tests on 1 GiB virtual disks. Tests are quicker this way and less demanding on the old NVMe I put into the NAS for now.

I wrote data with “dd -if=/dev/urandom -of=/mnt/pool1/test” right after creating / expanding the pool. When expanding, I let a scrub run first.

The tests below were to see how much space is available in these conditions:

Mounting just a single 1 GiB Disk as a “stripe”. Does this yield 1 GiB or not?
Creating a 3-wide RaidZ1 volume (3x1 GiB)
Expanding the 3-wide RaidZ1 with another 1 GiB disk
Creating a 4-wide RaidZ1 volume (4x1 GiB)

Setup	zfs list avail before	dd	ls -lsa	ls -lsa	du	du -h
Single 1GiB	828 M	827 MiB	867435008	827 MiB	847741	828 M
3wz1	1.70 G	1.7 GiB	1828454912	1.70 GiB	1785621	1.8 G
3wz1exp+1	2.37 G	2.6 GiB	2775581184	2.58 GiB	2484021	2.4 G
4wz1	2.59 G	2.6 GiB	2788426240	2.59 GiB	2722384	2.6 G

It seems no matter what zlist / dd shows, I was able to write the whole 2.6 GiB. Why I do not get 3 GiB in a 4-wide RaidZ1 with 4x1 GiB drives or the full 1 GiB in a 1-wide stripe is something I do not understand yet, but I am new to ZFS and assume it is just like that because of internal shenanigans, reserved space for metadata etc.
But I was able to see that no matter if I expanded the pool or created in the right width from the beginning, I was able to write the same amount of data. Since this is /dev/urandom I am not able to check if the data is correctly written, but until proven otherwise, I suppose it is

An integral part for me for a file system is also to know how much space files take and how much space is available. I hope this is something that can be fixed in the near future. Currently I am reconsidering if I want to really start with 3x4 TiB NVMes in my new NAS and expand it later or just put in 4 and have the space correctly calculated. When that space is nearing full, and prices are down, get another set of 4 and fill the NAS up with another Pool which shows the correct numbers.

winnielinnie · October 19, 2024, 7:43pm

According to the third row, if you are to believe the numbers are true, you were supposedly able to:

Write a 2.6-GiB file (urandom) onto a pool with 2.4 GiB capacity… and then “doh”, this non-compressible 2.6-GiB file magically only consumes 2.4 GiB of storage space.^[1]

This same phenomenon was shown by @Hittsy with video files:

If you repeat the test, but have the urandom file set to a static size, such as 1 GiB, you’ll clearly see this in action. Rows 1, 2, and 4 will reflect its storage usage correctly as 1 GiB. While row 3 will incorrectly claim its storage usage is notably smaller.

dd if=/dev/urandom of=/mnt/pool1/test bs=1M count=1024

It’s why @Hittsy’s 102-GiB UHD 4K video file supposedly consumes only 72 GiB of storage space.

Spoiler alert

It does not. ZFS is presenting patently wrong information.

Since you were only testing with 1-GiB virtual disks, imagine the difference when you’re dealing with 18-TiB HDDs. ↩︎

Stux · October 21, 2024, 5:34am

Here is the relevant explanation from Mathew Ahrens explaining the issue, the problem is that when old blocks are freed they need to use the old deflation ratio to prevent space accounting over-accounting for freed space.

This means that it would be possible to solve using a “time dependant deflate ratio”, which was not implemented due to lack of time.

In the future, “RAIDZ Expansion improved space accounting” could be implemented, which would resolve the issue, and he thinks “that it shouldn’t be much more work to do this as an extension”

github.com/openzfs/zfs

Comment by ahrens - RAIDZ Expansion feature

openzfs:master ← ahrens:raidz-expand

@zack2491 thanks for giving this a try! Unfortunately we can't just change the …deflation ratio, because the space accounting already takes into account the old deflation ratio. So for example, when a block is freed, if the deflation ratio is different from when it was allocated, we would decrement the wrong amount of space from the dnode, dataset, and dsl_dir's counts of deflated space used. It might be possible to extend the RAIDZ "time dependent geometry" concept to also do "time dependent deflate ratio", so we could know based on each block's birth time whether it was accounted under the old vs new deflation ratio. There are a few other places that use the deflation ratio that would also need to be investigated. Unfortunately I don't have much time left to spend on this project so I would have to leave this for a follow-on project. I think that it shouldn't be much more work to do this as an extension, where we have RAIDZ Expansion for a while and then we add "RAIDZ Expansion improved space accounting", and blocks written after that point would have the improved accounting (that is, assuming it's possible at all). I can definitely update the manpage to be a little more strongly worded in terms of the potential space accounting surprise.

Captain_Morgan · October 21, 2024, 9:51am

It’s been an interesting discussion on what happens immediately after a RAID-Z expansion. That problem is well understood now.

Can anyone verify what happens after the expanded pool starts to fill up. Does the space estimation improve or get worse?

Can anyone write a script with an algorithm that provides a better approximate estimate of the current space available? That would be useful and might resolve the issue faster.