Metadata vdev, data vdev expansion and Electric Eel query

Krill · August 4, 2024, 7:23am

Hi Everyone,

I’m writing this post to ask for some advice regarding a specific situation that is about to become…possible: what is the best way to add disks to a data vdev (once Electric Eel is released), along with a metadata vdev, and then rebalance to ensure the most effective use of the available resource?

The context is this:

I have a 6 wide raidz2 data vdev (Iron Wolf Pro, 16TB dive, ST16000NT001). I have four additional drives (same make and model), plus two Samsung PRO 990 1TB drives and one WD blue SN570 1TB drive (WDS100T3BOC). It is used for storing a combination of large media files (nothing less than 300MB to 20+GB), and media files ranging from 5MB to 120MB. The pool in it’s entirety is write once read many; essentially a typical media pool.

However there are generally many additional files present, all of which are less than 1MB and are accessed regularly. I understand that these can be stored on a Metadata vdev with the correct settings, however there is a slight hiccup in identifying how many files (As per the guide from constantin) as the record size for all datasets is currently set to 128K (I’m a noob, but I was more of a noob when I set this server up last year) That said, there will not be more than 50,000 of these small files that are smaller than 1MB (and I would expect the number ot increase to around 120,000 as the pool fills).

My thoughts from reading the guides are to set a metadata vdev as a triple mirror (possible quad mirror given the concerns regarding Samsung Pro 990 drives right now even though I haven’t seen these issues, using two WD black 1TB drives, keep the blue spare), with special_small_blocks=1MB, and set the record size to 2MB. Whilst this is not space efficient for the NVMe drives, there are a significant number of the files over 512KB in size, and the space/drives exists so might as well use it (unless dangerous etc).

However, to save writing the data twice, (as I am in no rush), would there be an issue with adding a metadata vdev such as a quad mirror, at the same time as adding the 4 additional disks, and then running the rebalance script (is it the markusressel one? zfs-inplace-rebalancing)

This begs the question if the other possible methods are better (back up, destroy storage pool, recreate from scratch as 10 wide raidz2 and reload data (does not need Electric Eel), or even to create a new 4 wide raidz2 with the metadata vdev and transfer the files over in situ, before decommissioning the 6 wide pool, amd then adding the 6 disks (requires Electric Eel).

Ultimately, I think this is a relatively minor issue for me, as this is more of a “Because I can” scenario, rather than a need, but I think this highlights a couple of scenarios that iX might want to consider if there is any form of GUI in place for vdev disk addition rebalancing.

Any advice or comments are greatly appreciated.

etorix · August 4, 2024, 9:22am

Not sure what you have in mind here. If you want to add a second raidz2 vdev, you can do it now—but while 6-wide raidz2 + 4-waide raidz2 is possible, it would be best to make it 2 x 6-wide raidz2. If you want to widen the single raidz2 vdev to 10, you indeed have to wait for Electric Eel, but you should rather start from the current 6-wide vdev and add 4 drives than create 4-wide anew and add 6 drives.

Since this is all about reads, consider using a persistent metadata L2ARC rather than a special vdev. You can do this at any time and redundancy is not required since it is NOT a critical vdev.

Krill · August 4, 2024, 12:48pm

I’m sorry, I may not have explained the query well enough. One aspect of the question is the order of operations.

There are four proposed operations intended:

Change record size
extend vdev (I understnad this cannot be done until Electric eel is released)
add Metadata vdev (and the change to small_block_size)
rebalance the current information in situ to take advantage of the above changes in pool structure

Does the order of these operations/changes/tasks matter? Should a dataset be rebalanced after each change?

I currently have 256GB ECC Ram in place. I do not intend to create a second vdev (inefficient use of resource as noted). I did not know that it was possible to use L2ARC for persistent metadata/small files, which would make most of this post redundant, so thanks!

Davvo · August 4, 2024, 1:11pm

Destroying the pool and creating it again. Maybe not the most convenient, but without doubt the best.

This is indeed the rebalancing script that’s been reccomended on these forums for a while; I have succesfully used it myself.

Nope, you cannot use L2ARC for storing small files; you can set your L2ARC as persistent, meaning it won’t be empy upon reboot (it might however take a while to populate depending on its size). Doing so will allow to hold in cache the most frequently read files.

If you want better performance reading small files 100% of the time, even not frequently hit ones, you will have to go for the metadata VDEV… but unless your use case is very specific and requires very strict perfomance boundaries, I as well would suggest going the L2ARC route: the gains of a metadata VDEV are not worth the compromises most of the times.

Last thing you want to do is rebalance, the second last is adding the metadata vdev; there is no need to rebalance after each step.

Constantin · August 4, 2024, 1:19pm

Changing record sizes and then rebalancing is something you can do now, if you like. Especially if you add a persistent, metadata-only L2ARC ahead of time, you’ll get most of the benefit of a SVDEV for a WORM archive like you’re describing. Where sVDEV really sings is if you have a pool that should be hosting small and big content and where the small content is changing / getting accessed a lot.

For example, in the past, people would set up a high-capacity HDD Z-whatever pool for bulk file storage and then they would also set up a smaller mirrored SSD pool for virtual machines (VMs) and databases (DBs) since those kinds of applications call for a lot of file access, a lot of small files getting accessed, and said small files are also changing a lot. Large record sizes and slow HDDs make a really bad combination for such work loads, so creating a separate pool with SSDs to handle those work loads made a lot of sense.

What fusion pools (sVDEV + VDEV) allow is a rationalization / repurposing of drives such that all types of data can be accommodated in one pool by virtue of planning in advance and carefully tuning each dataset to its intended use - that is, by record size and small file cutoff. The bonus is that all small files benefit, not just the ones that used to reside on the SSD pool, ditto all metadata. Usually, at the cost of adding one / two more SSDs to the NAS than you used to have (4-wide mirror in my case).

Another thing to consider doing in a WORM context is consolidating data. I had a bunch of backups of older drives and their system folders that involved tremendous quantities of small files that really challenged rsync and TrueNAS performance. I didn’t access them often, so I consolidated them into flexible sparsebundle archives on the Mac. Millions of small files consolidated into thousands of larger ‘bands’ - significantly speeding up rsync, directory browsing and reducing metadata by leaps and bounds.

Anyhow… as long as your machine has the RAM to support an L2ARC (64GB+), I’d start with that since the L2ARC is not essential / redundant. sVDEV is something you can do later, ditto adding a 6-drive Z2 VDEV (I would always add similarly-composed VDEVs). In the meantime, you can consolidate files into archives, adjust record sizes, and rebalance extant files to reduce metadata needs and improve pool performance (on top of the L2ARC, which will also help a lot with browsing, rsync, and like tasks).

Krill · August 4, 2024, 2:54pm

Thanks everyone. I did a bit of reading and I’ve enabled L2ARC as metadata only, I’ll see how it feels going forward.

Stux · August 4, 2024, 6:46pm

I’m not 100% certain but I think you may also be able to “rebalance” by performing a local replication of a dataset, one dataset at a time. Or recursive, as long as you have enough free space.

I also think the block-cloning feature may break the rebalance script. Which essentially does a cp in place. I think.

Krill · September 21, 2024, 11:45am

bump

I upgraded to the Beta today as Passbolt had been approved for migration, which was the final app I was waiting for. The upgrade went smoothly, every app migrated, it was great. I upgraded the storage pool to the new ZFS version for vdev expansion, and started the expansion of the first (16tb) disk (Seagate Iron Wolf Pro ST16000NT001 fwiw).

Which leads me to another query: How long should it take to add each disk? Whilst everything is accessible on the server and it responds, it’s been sitting at 25% (pool.attach) for half an hour. I wonder if this is something that needs to be run at night…

etorix · September 21, 2024, 4:10pm

Adding drives should be essentially instant, so you have a problem…

Krill · September 21, 2024, 4:15pm

Yeah I figured that. I restarted the server and the vdev says it is now 7 wide z2 but there is no change in the volume of the pool. Probably time for a bug report. When I try to add the next disk it gives error 2098 which it says is not a valid error code (I imagine the error code recognition has not been updated given its a beta)

Now running a scrub to see what that says. The storage pool is fine otherwise, apps still connected to it, nfs and smb sharea are still accessible.

Stux · September 22, 2024, 10:48pm

Look at zpool status in the shell.

It can take many hours/days to finish an extension on large disks which are full.

Krill · September 23, 2024, 7:54am

Thanks Stux, that’s the solution.

Yes, it looks like 60 hours per 16tb disk but it’s still moving forward.