LARGE Pool Data Balancing After Expansion

VersionZero · April 30, 2025, 2:07pm

My main system is a new Truenas build, 24.10.

My main storage is aprox 120Tib over 48 ssd’s. We are currently using over 105 of that.

I am going to expand the pool by doubling the storage. Should I worry about data balancing? If so what would be the best and or most efficient way to balance over 100tib of data?

Current Storage
3 Case Nas:
Norco RPC-4308 “Controller”
Gigabyte B550M DS3H
AMD 5700G
64gb Corsair Vengence
10Gtek XL710-QDA2 40gbe
LSI 9305-16E Sas Controller
Adaptec 82885T Expander
OS - 240gb ssd x2 Mirrored
Array1 - 4Tb hdd x8 RZ2

Norco RPC-4224 “Storage 1”
Adaptec 82885T
Array 2 - 4Tb ssd x24, 3X VDEV 8 SSD’s RZ2

Norco RPC-4224 “Storage 2”
Adaptec 82885T Expander
Array 3 - 4Tb ssd x24, 3X VDEV 8 SSD’s RZ2

New Proposed Storage

Norco RPC-4224 “Storage 3”
Adaptec 82885T
Array 4 - 4Tb ssd x24, 3X VDEV 8 SSD’s RZ2

Norco RPC-4224 “Storage 4”
Adaptec 82885T Expander
Array 5 - 4Tb ssd x24, 3X VDEV 8 SSD’s RZ2

Opinons are much appreciated!

Jorsher · April 30, 2025, 4:30pm

The best way is to move the data completely off the storage then move it back…

There are scripts that kinda-sorta do in-place balancing, but I’ve never used one.

It would be really nice if ZFS re-balanced storage when you added a vdev, I don’t see why it would be difficult or impossible, but there are probably higher priorities.

Krill · April 30, 2025, 5:09pm

Isn’t that HBA bottlenecking the system?

https://github.com/markusressel/zfs-inplace-rebalancing/blob/master/zfs-inplace-rebalancing.sh is the link to the in place rebalancing script. Easy to use and will not take upore time than a manual rebalance.

awalkerix · April 30, 2025, 5:36pm

Or rather than trying to rebalance you can just keep using the array as-is and avoid changing things unless you encounter a definitive issue. These “rebalancing scripts” have their own negative impacts, especially if you have snapshots and replication.

VersionZero · April 30, 2025, 5:55pm

UGH… I have HDD backups, but moving 100+Tib off of platters, thats gonna take a min…

Surprisingly no. I have the arrays in a fan out design. The expanders act like network switches, so I get an aggrigate of all the drives hooked to them. Each link to the expanders supports upto 48gb/s. The 9305-16e has 4 links so 192gb/s capacity. The only bottle neck which sucks is the PCIE slot which for a 3.0 x8 is 8gb/s. But I am running a 40g qsfp+ cards which maxes out at about 5gb/s so I think that is the real bottleneck, but thats more then plenty for what it is used at. Now that is for reads, writes on the other hand are slower because of the parity, but I think that will be helped since I am expanding the arrays. For testing purposes, I had a second machine with another 40g interface, with a set of NVME’s in stripe and I was able to get a little over 4gb/s on large files which make up the majority of the storage, and slower on all the little ones, but not that big of a deal. I am pumping all that data through a cisco nexus switch which has 48 10g ports and 4 40g ports.

I like this, I am going to look into this!

Wont I get more IOPS though if the files are spaned over 12 vdevs instead of just 6?

etorix · April 30, 2025, 6:00pm

Are you limited by IOPS with your all SSD array?

VersionZero · April 30, 2025, 6:02pm

Write seem to run at about 15-20% of the speed of reads, even though the throughput on each ssd is pretty close to symetrical.

etorix · April 30, 2025, 6:08pm

And what makes you think this is IOPS rather than overfilling the SLC cache of the drives (if any)?

VersionZero · April 30, 2025, 6:13pm

I did a test with sets of the same ssd’s in stripe vs in rz2, and the write performace was much better. If the drives in stripe performed the same, I would of chalked it up to cache or buffer, but that was not the case. So being that the sets are in raid, I assumed it was the parity which was slowing the writes down, is this not correct?

My assumption if that was the case, is if I make more vdevs which allow more parallel IOPS, both my read/write speed would increase. The ratio would stay the same meaning my reads would still be 5-6 times the speed, but since I am capped by my NIC, the writes would be closer to the max of that connection since the data is spanned over 12 vdevs instead of 6.

I know that IOPS are different the throughput, but does that not factor into parity raid since there is a lot more going on then just writing data to a drive?

PK1048 · April 30, 2025, 10:27pm

Since you are doubling the storage (I assume doubling the number of top level vdevs), then I would not worry about or bother rebalancing.

ZFS will preferentially write to the less full vdevs and over time, unless your workload is write once, the data written with naturally balance. Since you will be adding as many vdevs as you have now, your performance will be about the same even if ZFS is just writing to the new vdevs (which it won’t be, but the majority of writing will be to the new vdevs).

The above assumes this is all one zpool.

VersionZero · April 30, 2025, 10:51pm

Most of the Data is Writen Once. They are large video files. We produce educational videos, so the raw files are quite large. I need fast access so the staff can edit, transcode, etc.

The video files are eventually put into cold storage after 2 years post production, untill that though, they have to be kept on hot storage.

My throughts were if my first VDEVs are full then, Truenas will only put data onto the new VDEVs, which performance wise would pretty much be the same as I have now. BUT, one of the selling points for the SSD arrays is to maximize performance. Therefore, as previously stated, would it not behoove of me to have the files spanning 12 vdevs instead of just 6? Seeing what you are saying, naturally if the files are constantly being modified, or moved, added/deleted, it would naturally balance itself, but in this case, I do not think that would happen as fast. Most of the files are 30gb+, and they generally only once put into the raw directories, and stay there untill put into cold storage and removed.

Yes it is all one large pool.

PK1048 · May 1, 2025, 9:22pm

Since most of the data is write once, then you probably do want to rebalance.

A full rebalance would require moving the data out of one zpool into another and then moving it back (so that when it comes back all the top level vdevs have about the same amount of free space). This is probably impractical

If you were to move the data within the zpool you would (slowly) get to the point where free space was equal among all the top level vdevs, and from that point forward you would be spreading the load among all the top level vdevs. You can see how full each vdev is by looking at zpool iostat -v <zpool name>. To gain the benefit for all the data some of it would have to move at least twice (the data moved before the vdevs were equalized), but the holes left as that early data was moved would be unequal. Perhaps the most optimal order to move the data would be to start by moving both the oldest and newest data , the oldest would be written to just the oldest vdevs and the newest to the newest vdevs. Still not an optimal situation…

Having no better ideas, I come back to not worrying about it and letting it rebalance organically. Will you really get enough better performance to counterbalance the time spent trying to manually rebalance the data ?

VersionZero · May 1, 2025, 9:38pm

From my calulations, I think so, but I have not got a definitive answer.

My reads are already pretty much maxing out the 40g NIC, but my writes are a lot slower, I was hoping that doubling the parallel vdevs and balancing the data to leave enough room open across all 12 that my write performance will drastically improve. I am hoping to get it north of 3gb/s which is about double what I am currently getting.

I am currently looking at the script mentioned above if it will accomplish everything, OR am I just going to spin up some platters, and dump the whole pool and re-copy it… IDK

awalkerix · May 1, 2025, 10:19pm

You will also want to verify that the rebalance script isn’t going to explode your space utilization (for example if snapshots are a part of your backup strategy).

VersionZero · May 1, 2025, 11:19pm

Can you articulate further please?

Currently, I do NOT have snapshots. My current back system is a secondary server, that runs hdd’s which is setup as rsync one way from MAIN to BACKUP. It is physically isolated in a different building, and is FIBER and POWER isolated as well.

If the script is dangerous or not viable, I will probably just setup another pool with an 8 way stripe of 16tb drives. That would act as my temp storage to move all the data off of the pool, and since it a stripe, I should get good transfer speeds. Once the data is copied and verified to the temp pool. I will nuke the existing main pool, add the new vdevs, and start the data dump on to the new pool.

The backup server will get a fresh update just before this process starts, and I will take it offline just to be safe. Being that the strip has no redundency, better safe then sorry…

awalkerix · May 2, 2025, 12:25am

The rebalance script is a “copy then delete original” script. This means if you have snapshots you can potentially end up increasing the amount of space used on the server. C.f. ZFS documentation.

You have to weigh what the actual benefits of rebalancing are against the cost in terms of time, space utilization, and CPU / IOPs while the operation is happening. Note that it’s also a good idea to make sure you’re not doing IO to the files from other applications while you’re rebalancing them. For most users the juice isn’t worth the squeeze.

VersionZero · May 2, 2025, 1:16pm

Very well said!

I figured as much. When I planned this was going to be a long weekend or something. Systems will be completly down and nothing eles running. Since I am the one who pays for the oranges, the juice is worth the wait of the squeeze!

PK1048 · May 2, 2025, 1:35pm

What are you getting for write performance (I assume via SMB)?

Looking back on your original post with config … You list 5 “Array’s” are these separate zpools? You sat 120TB over 48 drives, but each Array 24 drives … and your first array is HDD not SSD.

Are your SSD’s enterprise with high random write performance or consumer?

Just trying to understand your configuration and how to get you the write performance you want.

VersionZero · May 2, 2025, 1:59pm

Yes, SMB. Large file transfers 1.5gb/s+/-

Ignore array 1, that is for a different purpose as a different pool.

Array2,3 are the main pool currently.

Each case has 24 bays, filled with prosumer grade ssd’s with very good random performance. Each array is split into 8 ssd’s in a rz2 configuration. So for the 48 total between the existing cases, I have 6 VDEVS of 8 in RZ2.

Each case has its own link via 8644 cable to the HBA.

Since I am using 9305-16e I have 4 links which is why I am thinking of expanding with 2 more cases of the above configuration “array4,5”.

I am also considering adding another 9305-16e and using it for the new cases. I am limited by the PCIE slot, but that is still more then the NIC card.

PK1048 · May 2, 2025, 2:30pm

OK, so Array 2 and Array 3 are not really logically different, just separate chassis?

Your zpool has 6 top level vdevs, each a RAIDz2 of 8 drives.

You are planning to add another 6 top level vdevs of 8 RAIDz2.

What do you have the recordsize set to for the dataset in use?

What does a zpool list show for CAP and FRAG for the large pool?