RAIDZ2 Expansion: Unexplained 10TB Space Loss & 2x Overhead

Hi everyone,

I’ve encountered a massive, unexplained loss of available space immediately after expanding a RAIDZ2 vdev from 4 to 5 disks. The physical space allocated (ALLOC) is now almost exactly double the logical data size (USED), and I’ve worked through and ruled out all of the usual culprits.

The key detail is that the pool’s capacity reporting was perfectly normal and correct before the expansion began.


The Setup

  • Pool: A single raidz2-0 vdev with a special vdev for metadata.
  • Initial State: raidz2-0 with 4 x 20TB HDDs.
  • Goal: Expand the vdev to 8 disks by adding the 4 new disks one at a time using zpool attach.
  • Data: 10.3 TB, mostly large media files.

The Problem: Before vs. After

1. Before Expansion (4-Disk RAIDZ2 - All Normal)

With 4 disks, the math was correct.

  • Total Usable Capacity: (4 disks - 2 parity) * 18.2 TiB/disk = 36.4 TiB
  • Used Space: 10.3 TiB
  • Available Space: 36.4 TiB - 10.3 TiB = ~26.1 TiB. The output of zfs list correctly reflected this available space. Everything was as expected.

2. After Adding the 5th Disk (The Problem Appears)

I initiated the expansion with sudo zpool attach mediapool raidz2-0 /path/to/5th-disk. After the resilver finished, the numbers were drastically wrong.

  • Expected Usable Capacity: (5 disks - 2 parity) * 18.2 TiB/disk = 54.6 TiB
  • Expected Available Space: 54.6 TiB - 10.3 TiB = ~44.3 TiB
  • Actual Available Space: The zfs list command now shows only 33.9 TiB available—a loss of over 10 TiB.
~ $ zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
mediapool     10.3T  33.9T   118K  /mediapool
mediapool/media  10.3T  33.9T  10.0T  /mediapool/media
 ~ $ sudo zpool list -v mediapool
[sudo] password for utk:
NAME                                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
mediapool                                  92.8T  20.7T  72.1T        -         -     0%    22%  1.00x    ONLINE  -
  raidz2-0                                 90.9T  20.7T  70.3T        -         -     0%  22.7%      -    ONLINE
    ata-ST20000NE000-3G5101_WVT            18.2T      -      -        -         -      -      -      -    ONLINE
    ata-ST20000NE000-3G5101_WVT            18.2T      -      -        -         -      -      -      -    ONLINE
    ata-ST20000NE000-3G5101_WVT            18.2T      -      -        -         -      -      -      -    ONLINE
    ata-ST20000NE000-3G5101_WVT            18.2T      -      -        -         -      -      -      -    ONLINE
    ata-ST20000NE000-3G5101_WV             18.2T      -      -        -         -      -      -      -    ONLINE
    ata-ST20000NE000-3G5101_WVT            18.2T      -      -        -         -      -      -      -    ONLINE
special                                              -      -      -        -         -      -      -      -         -
  mirror-1                                       1.81T  7.66G  1.81T        -         -     0%  0.41%      -    ONLINE
    ata-Samsung_SSD_870_EVO_2TB_S  1.82T      -      -        -         -      -      -      -    ONLINE
    ata-Samsung_SSD_870_EVO_2TB_S  1.82T      -      -        -         -      -      -      -    ONLINE

The Investigation & What’s Been Ruled Out

The zpool list -v output, which shows the raidz2-0 vdev has 20.7 TiB allocated physically for only 10.3 TiB of logical data. This is a 2x overhead that appeared during the expansion.

Here is everything we have checked and ruled out:

  • ashift Mismatch: Confirmed correct.
~ $ zdb -C | grep ashift
      ashift: 12
      ashift: 12
  • Small recordsize: Confirmed it’s the default
~ $ zfs get recordsize mediapool/media
NAME             PROPERTY    VALUE    SOURCE
mediapool/media  recordsize  128K     default
  • copies=2: Confirmed it is the default of 1.
~ $ zfs get copies mediapool/media
NAME             PROPERTY  VALUE    SOURCE
mediapool/media  copies    1        default
  • Logical vs Used Space: Confirmed there are no discrepancies from compression or copies.`
~ $ zfs get logicalused mediapool/media
NAME             PROPERTY     VALUE   SOURCE
mediapool/media  logicalused  10.3T   -
  • special vdev: Confirmed it’s barely used and not the cause.`
~ $ zpool list -v mediapool
NAME                                          SIZE  ALLOC   FREE 
...
  special
    mirror-1                                 1.81T  7.66G  1.81T 

The Question

Has anyone ever seen behavior where a raidz2 expansion (zpool attach) triggers a massive, uniform 2x storage overhead? The fact that everything was normal before the expansion makes me think this could be a bug or a very obscure calculation related to the resilvering/rebalancing process itself.

Any ideas or insights would be hugely appreciated. Thank you!

Did you read through this post?

TLDR:
There’s a bug with space reporting

1 Like

Hey LarsR,
Thanks for the quick response. Now I am going through the post.

To clarify, is this bug related to ZFS?

Additionally, is this a reporting bug, or should I avoid VDEV expansion altogether until the fix is available? That is, the problem is much deeper and not just with reporting space :thinking:

As far as i can remember it’s a consequence of how raidz expansion currently works but i don’t know the exact technical details. I don’t plan on using raidz expansion anytime soon so i just scimmed over the explanations… And as far as i can remember the zfs devs work on an improvement on the raidz expansion implementation, but i can’t say when it will land and if it will remove the problem in it’s entirety or just mitigate it to a certain percentage…

I gathered that it is a reporting issue, so the entire storage is usable and can be accessed, but we lose any credible way to see how much space is free.

Basically, for all practical purposes, VDEV expansion is broken.

I now have to create zpool from scratch!

1 Like

What happens if you use zfs rewrite?
Does that “fix” the issue?

It allows you to recover some capacity by rewriting the data with the new parity-to-data ratio, but it will keep reporting the wrong used/free space.

2 Likes

Extra space is available, so “for practical purposes”, raidz expansion does (about) what is is intended to; only reporting is borked. Do not expect a fix.

With “only” half a drive worth of data, the good old way of “backup-destroy-restore” is your best bet. Take this as an opportunity to rethink whether the special vdev is useful, and whether it might be replaced by persistent L2ARC for metadata (speeds up reads, not writes, BUT is not critical for the pool).

1 Like

I only just added a fifth 10TB drive to my Z2 vdev and noticed that I was shortchanged about 5TB in the process. What I did witness when I rewrote about 300GB of data from one folder within the dataset, the usable capacity went up and then the moment I deleted its original folder, the usable capacity went back down again!

I’m waiting for Goldeye to rebalance the rest of it but it is interesting how complex a problem this is. In any case, the fact I saw the Usable Capacity increase upon throwing in more data, that gives me comfort in knowing that the alerting won’t get annoying when more data is added to these drives in the short-term. :slight_smile: