[Accepted] Create 2 GiB buffer space when adding a disk

Problem/Justification

When replacing a drive in a vdev, a new drive even from the same manufacturer may just be a hair too small to fit the vdev.

Examples of users impacted by the recent change in SCALE 24.04.1 to remove the default 2 GiB swap partition, which acted as this buffer:

Impact

With a default buffer space of 2 GiB, a slightly-too-small replacement drive could still be added with a smaller or no buffer on that drive.

FreeNAS, TrueNAS CORE and TrueNAS SCALE before 24.04.1 did this and it’s really an amazing Quality of Life feature for users. Please bring it back.

2 GiB was chosen because CORE created buffer partitions of exactly 2 GiB; SCALE before 24.04.1 created buffer partitions of 2 GiB + 512 bytes; and users report size variation of between 100 MiB and 1.1 GiB on “same capacity class” drives. 2 GiB is a safe buffer that continues a good tradition.

Edit:

After discussion, a good way to implement may be to make the partitions on drives added to a new vdev 2 GiB smaller than max, without a dedicated “buffer partition”.

When replacing or adding to an existing vdev, use the same logic of “2 GiB smaller than max”, and if that’s smaller than the smallest member in the vdev, try to make the partition as large as that member, to allow the replacement to succeed.

No additional UI element is needed. Replacement would still show an error when the drive is genuinely too small; while the slight variations in drive size between revisions or manufacturers are taken care of by this 2 GiB buffer.

Gradually increasing the size of a vdev by replacing members with larger ones also still works as before.

Graphics illustrating this mechanism: [Accepted] Create 2 GiB buffer space when adding a disk - #16 by winnielinnie

User Story

User has an existing vdev with 2 GiB buffer partitions (“swap” partitions) and wants to add or replace a drive: This method works. The new drive doesn’t have a buffer partition, but it has buffer space. Same difference as far as ZFS is concerned.

User has an existing vdev without any buffer space (post removal of this default, vdev created in SCALE 24.04.1 or later) and wants to add or replace a drive: No change to current functionality.

  • A drive that’s exactly the same size will work, and won’t have buffer space.
  • A drive that’s just a hair larger will work, and will have buffer space.
  • A drive that’s a hair too small continues to be too small. “The UX damage created by removal of default buffer space can’t just be un-created”

User creates a new vdev: All drives have buffer space, making future replacement or addition easy. Some thought should be given to how the UI determines the partition size if the drives are all the same “capacity class” but of slightly varying size: However, the UI handles that case now, and the same logic could continue to be used. Just with a 2 GiB buffer in place.

User has an existing vdev and wants to add or replace a drive with a bigger one: This works, the new drive is placed with 2 GiB buffer space. Gradual replacement and expansion of a vdev through larger drives continues to work.

All stories I can think of for any vdev ever created in FreeNAS or TrueNAS are either improved or, where this is not possible (existing vdev without buffer and slightly smaller replacement drive), not worsened compared to the status quo.

10 Likes

It’s kind of weird having to make a feature request to bring back something sane and sensible. Yet here we are.

Here is an example of a user affected by this change in SCALE.

3 Likes

2GB or 2GiB? Don’t be shorting us those bits.

2 Likes

Imma drop this here.

https://ixsystems.atlassian.net/browse/NAS-128988

2 Likes

So don’t make them swap partitions. They could be da, non-Filesystem data.

Or make them swap partitions but don’t add them to fstab.

The point is having a buffer for drive replacement. Not having swap.

2 Likes

@Stux is pointing out when this change was made in SCALE. Prior to this, users of SCALE benefitted from the same buffer partitions as did Core users.

2 Likes

Ah got it. Well, throwing out buffer partitions because they happened to be swap partitions seems a little hasty :sweat_smile:. Maybe keep them and just don’t make them swap. Their point never really was to be swap.

2 Likes

Agreed.

1 Like

Agreed as well and I voted. It doesn’t need to be a Swap partition or any partition at all, just create the drive ZFS partition approximately 2GB smaller than the a full single partition. I’m not even sure if 2GB is needed but seems to be a safe bet since it worked for over 10 years without issue.

2 Likes

I agree it’s a good idea, and I voted for it–but there were the occasional complaints back in the day about that “wasted space.” Everything’s a trade-off.

1 Like

“Wasted Space”, yes I recall those days, and they are still happening. It took awhile for me to see the light as well, but 2GB from a 2TB (the original drives on my FreeNAS machine) or larger drive is small to ensure the replacement would be accepted into the VDEV.

Now days for me it is about the cost of storage, not the unused space. It is a mind set people have to overcome.

2 Likes

I like this. On replacement, make the partition at least as large as the smallest drive in the vdev, and beyond that as large as possible while retaining a 2 GiB buffer.

That way “12TB for 12TB” works even if the replacement drive is a little smaller, and “24TB for 12TB” works and will, after replacing all, increase the overall size of the vdev.

And it’d all work without needing an additional UI element for it.

3 Likes

For added context, here is something shared by @Whattteva.

From the FreeBSD Handbook, Chapter 20, “ZFS”.

The above applies to ZFS generally, rather than FreeBSD specifically.

4 Likes

I like this idea the best.

Now the question is, what about those on SCALE who already created their pools after this regression was introduced? I guess they’ll just have to deal with returning HDDs and trying to purchase from different vendors and brands to hopefully find a replacement drive that will meet the minimum required capacity?

EDIT: I suppose for such users, when the time comes to “replace” a failed drive, they might as well spend a little extra money to purchase a larger capacity drive. Replacing a 10-TiB drive? Purchase a 12-TiB drive, since you don’t currently have the assurance of a buffer partition.

3 Likes

Nothing TrueNAS can do here. This shows just how much the choice to throw the buffer out entirely because swap got in the way impacts user experience.

If the newly created pool isn’t very full yet, a user can decide to move the data off, and recreate it with a buffer.

Beyond that - as you say, 12TB gets replaced with 14TB, if anyone at the org still remembers that’s prudent 6-10 years down the line when a disk fails :sweat:

1 Like

Would you say these graphics explain the process that TrueNAS should do?

* Not drawn to scale


Legend:


New disks are purchased. They are all “12-TiB”.


When a pool/vdev is created, TrueNAS makes a single partition that is the total capacity of the drive minus 2 GiB. Since this is a brand new vdev, the buffer space is maxed out to 2 GiB.


After some time, one of these drives fail.


A new “12-TiB” drive is purchased and installed in the server. Even though it is a “12-TiB” drive, its capacity is slightly smaller than the existing drives. :warning:


It can still be used as a replacement drive, since there is enough buffer on the current drives. TrueNAS creates a partition on the replacement drive that is equal in size to the existing partitions. The leftover space is used as a buffer.


We throw a party to celebrate?

4 Likes

14 votes as of this posting! Let’s see if a few more will be donated today.

@winnielinnie I like the graphic representation you put together and it is spot on! Is that a 3 drive RAIDZ2 :slight_smile:

Are you the one in the center of the photo? Ha Ha.

2 Likes

Three-way mirror.

I don’t deal with all the nonsense of RAIDZ and striped parity goofiness.

My PR on the OpenZFS GitHub, to disable the creation of non-mirror vdevs, keeps getting closed. They’ve already banned me and I’m running out of sockpuppet accounts. I must remain persistent for the sake of all ZFS users around the world.


Would have been nice if this miracle happened.

1 Like

Awesome graphic!

I’ll point out that you can’t buy 12 TiB drives. You can buy 12 TB drives, which is roughly 11.16 TiB - before ZFS overhead which drops it a little further.

TB counts in powers of 10, because marketing and bigger number wins.

TiB counts in powers of 2, because the nerds eventually gave up the fight and said “ok FINE you can have your Terabyte as power of 10, we’ll just use Tebibyte from now on. So THERE”

2 Likes

The marketers steer the storage manufacturers.

Imagine if RAM was sold as “GB” instead of “GiB”.

2 Likes