LARGE Pool Data Balancing After Expansion

VersionZero · May 2, 2025, 3:04pm

Yes.

Default. 128k

CAP: 89% FRAG: 0%

PK1048 · May 2, 2025, 9:09pm

This is part of your write performance issue. As a zpool fills, it takes more and more work (and time) to find a place to put new writes (since every write is a new write due to the copy-on-write design). At somewhere above 70% to 90% full (CAPacity) you will start seeing write performance degradation. The exact threshold depends on a number of factors such as your typical write size.

You may also be able to improve write performance by tuning the recordsize to better match your workload. But remember that recordsize is not like conventional filesystem block size. ZFS is a variable block size filesystem. recordsize sets the upper limit for block size and the dirty buffer threshold. When the amount of writes queued in the ARC equal the recordsize or after 5 seconds they are written out to disk all at once. With HDD this made the I/O to disk much faster as a HDD is much better a writing a 128KiB block of data then a bunch of smaller blocks. SSDs vary in terms of how much faster they are for large sequential writes compared to smaller writes.

Also note that a change to recordsize (which you can make on the fly) will only effect data written from that point in time forward. Data written prior will still be in blocks no bigger than 128KiB. With your workload, writes of 1GiB or more, I would certainly try tuning recordsize up to the maximum of 1024KiB (1MiB). The downside is that you will see slightly less free space as some padding to recordsize does occur.

But … as soon as you add the new storage and your CAP drops from 89% you should see a very large improvement in write performance.

VersionZero · May 2, 2025, 11:18pm

After you asked about it, I was reading about the recordsize, and what you said mirrors what I read. So I will probably turn it up to 1mb since 95% of what is on the pool is large files.

Thank you everyone for the input. I think just for my confidence, I am just going to offload the data, onto a second array, and expand the pool and onload the data back.

I have a 100% confidence this should go off without a hitch, which I cant say for the script.

pjrobar · May 3, 2025, 1:19am

You don’t need to move the data to another pool, moving it to another dataset is sufficient.

Jorsher · May 3, 2025, 9:39am

Doing that on the existing pool will not work as well as offloading it. His existing vdev is already 90% capacity. Adding the second vdev (0% capacity) and moving the data to the new dataset, will just mean ~10% of each file goes to the existing vdev and ~90% of each file goes to the new vdev (rough estimates for illustrative purposes).

Moving all data completely off the pool and back, he’ll get a sweet 50/50 distribution between vdevs and optimal performance. A bit of a pain, but I think worth it in the long run if you have the storage to offload to.

I keep hoping that one day ZFS will be able to redistribute in-place when adding new vdevs… It seems like adding an empty new vdev, it would be (relatively) easy to move some existing data from the existing vdev(s) to the new one. Adding a vdev with different number of disks/size of disks/raid type would complicate things, but that’s not best practice anyway.

VersionZero · May 3, 2025, 2:24pm

Does it base the spilt on available percentage?

A thought I had was say I have like 10 tib left on the existing data set, I create a new dataset AFTER I add the expansion. If I copy 20 tib from the old dataset, and put it on the new dataset, would it split in theory 10 and 10 on old and new drive sets? or would it be like 18 and 2 if it is based on percent available?

Jorsher · May 3, 2025, 2:27pm

I don’t know the exact calculations done behind-the-scenes, but yes ZFS does try to fill disks at the same rate. I believe it’s based on percentage of capacity instead of physical capacity, which is why it’s not recommended to mix vdevs of different capacity.

New dataset vs old dataset doesn’t affect how the data is distributed, as far as how much data is moved to each vdev it still won’t be 50/50, it’ll be split based on available capacity on each vdev. The reason ‘new dataset’ was recommended for rebalacing in-place, is because if you move data around within the same dataset – the index is just updated and the physical data isn’t actually rewritten. However, if you move a file from one dataset to another, the data is physically rewritten – similar to moving data between different volumes/disks. This is why cut/paste or moving a file on the same dataset is ‘instant’, and it’s slower to move from one dataset to another.

VersionZero · May 3, 2025, 2:33pm

Ah, ok.

Yeah I will proceed with doing fill data dump. I am trying to figure out the fastest way of doing it.

I would like to have a stipe of disks, but I am not sure of the best way to move the data is. Truenas does not have a native file explorer if im correct, and SMB performance from share to share even on the same machine is weird and non consistant.

Jorsher · May 3, 2025, 2:45pm

If you have SSH on the other side, you could rsync over SSH instead of using SMB.

If the backup disks are mounted on the system, you could rsync locally.

One of those cats with a million ways to skin it.

PK1048 · May 6, 2025, 8:42pm

ZFS replication (zfs send | zfs recv under the covers) is generally the fastest way to move data from one TN to another.

You already have a backup copy of the data, right? Just use that. Stop user access, let a final (for now) backup complete, then use the backup as the source and copy it back to the production TN after adding the storage and recreating the zpool (faster and cleaner than deleting data).

I don’t understand the use case for the additional copy, or was the additional stripe zpool going to be on the same TN as production? You can still use replication to copy data between zpools / datasets on the same TN.

VersionZero · May 7, 2025, 2:34am

The additional copy was an OH SHIT backup. My intent was to leave the full backup offline INCASE I messed something up. I am still a rather new transplant to truenas, so just another redundancy. Plus I figure the onboard strip is going to be faster then pulling a copy through the network even at 10g.

PK1048 · May 8, 2025, 2:36pm

I was discussing this with a coworker last night and he suggested the following if you want to improve performance.

build your temp stripe (or even better yet, make it a huge raidz1 just in case)
copy your production data over
destroy your existing production zpool
physically add the new hardware
rebuild the production zpool but instead of 12 x 8-way RAIDz2 use 24 x 4-way RAIDz1
set recordsize to 1m
add at least one hot spare to the zpool, 2 is probably a better number given the number of drives involved
copy your production data to the new production zpool

Write performance is directly proportional to the number of top level vdevs in the zpool. By going to 4-way RAIDz1 vdevs you double the number of top level vdevs and you really don’t loose much redundancy, especially if you add hot spares.

VersionZero · May 9, 2025, 2:16am

Very intresting!

Only concern I have which is why I picked rz2 was pool failure during resilver incase of 2nd drive failure. In the end, yes I have a full copy, but would be kind of a PIA to have to dump a full set of data if the pool fails.

Thank being said, the write perfomance it what I was looking for.

Since the vdevs are smaller, I am assuming IF there was a resilver, the process would be quicker?

Jorsher · May 9, 2025, 7:17am

I had a very lopsided pool from years of filling a vdev, adding a new one, filling a vdev, adding a new one, etc. I shifted all the data to another pool, rebuilt a new pool from scratch and moved the data back.

Before I did that, it was taking 4-5 days to scrub. Now with the data balanced, it only takes two despite being much more data (~500tb).

However, as far as resilvers go I’m not entirely sure. I think it only needs to trawl the data on the affected vdev and not the entire pool, so performance may be similar. Someone smarter will correct me. Resilvers on smaller vdevs are probably faster, but because the capacity is probably smaller.

It’s worth the time to move the data off and back of performance is your goal, IMO.

PK1048 · May 9, 2025, 5:15pm

Typically RAIDz vdevs resilver at the speed of one drive. So the resilver time will be proportional to the amount of data it needs to scan. See ZFS Resilver Observations – PK1048 for what I measured over 10 years ago (FreeBSD ZFS).

And I agree that the added belt and suspenders is worth something. But adding hot spares removes the time to discover and attach a replacement drive from the time to recover. As soon as a drive fails the hot spare starts replacing the bad drive. I see that as a reasonable compromise, in your case between performance and reliability. But everyone has their own risk tolerance and needs to make their own decisions.