Metadata vdev drive replacement + size increase?

dxun · June 9, 2024, 2:20pm

I’d like to confirm my plan to improve the resiliency of my main storage pool by replacing the existing drives with PLP drives which should also give me a larger metadata vdev size after drive replacement.

The pool is backed by a 2-way mirrored vdev of Samsung 850 PRO 128 GB drives. These drives do not have PLP so I found a pair of Dell LB206S SLC drives that should have PLP and that I will replace the two Samsungs with. The Dell drives are also 200 GB each, so after the replacement is done, I believe I should also get the larger metadata vdev as well.

Here is how I plan to do it:

go to Storage → Pools → Pool Status
Replace one Samsung drive in Special vdev
remove the correct (!) disk out of the storage server
insert the Dell drive in (it should register after insertion in the TrueNAS UI)
Extend the Special vdev to that new disk
Replace the second Samsung drive
insert the second Dell drive in
Extend the Special vdev to the second new disk

That’s it - metadata vdev should now be 200 GB and should be mirrored.

Are my expectations correct here?

Constantin · June 9, 2024, 2:26pm

The sVDEV allocates 25% of the sDVEV capacity to metadata and 75% to small files by default. You can change that ratio. Have you looked into how much room metadata / small files are actually consuming?

As for your proposed disk-swap approach, it seems workable though I would verify in advance which S/N goes with which drive. I would also consider a 3-way sVDEV mirror, but that’s just me…

dxun · June 9, 2024, 2:36pm

Thanks, I thought of 3-way, but since I am doing a daily replication to a separate machine in a separate location, I am currently comfortable with the risk.

Hm, I thought the default should be to place no small files to the metadata svdev?

I am reading around 3 GB of metadata right now on the pool right now so capacity should not be an issue. The pool is mostly empty now and the its size is 54 TB - the unofficial rule of thumb is to expect 0.3% of storage space dedicated to metadata, which is 162 GB (hence the 200 GB drives choice).

I am mostly doing the swap for peace-of-mind of PLPs on the Dell enterprise drives.

Constantin · June 9, 2024, 2:47pm

Per @HoneyBadger, the sVDEV reservation for small files vs. metadata can be adjusted via the zfs_special_class_metadata_reserve_pct parameter. I have never used it.

I’d look into adjusting your dataset recordsize to minimize metadata needs. Per @winnielinnie, there is very little loss re: larger record sizes if zstd, lz4, etc. compression are enabled. It really helps squash metadata needs down. 1M is good for images, archives, videos, and other “large” files, smaller recordsizes (including the 128kB default) are better suited for VMs, databases, and like workloads.

By bundling many little files into archives (aka sparsebundles on the Mac), increasing the recordsize, enabling zstd compression, etc. I was able to reduce my pool metadata needs to about 0.08%. See the sVDEV resource I helped co-author for more details.

dxun · June 9, 2024, 5:27pm

I thought I had the whole thing squared away but you’ve given me quite a bit to think about. Let me think this through, do some SSD qualifications and report back on what I have.

One suggestion I have is to consider adding the sVDEV “expansion” procedure I am embarking upon myself to the sVDEV resource. I know this has been asked and answered in the past on the old forums yet it is not easy to find. I am also sure it will be asked in the future. Might be a good place to land this info.

In the meantime, I’ve noticed something curious. If you look at my pool, it currently sits at 2.9 T, metadata is 5.23 G (not 3 G, as I claimed, but ok).

Now, if I look at the L1 total output from the zdb -LbbbA.... command, I am seeing it is over-reporting the size by a couple of hundred MB. Any thoughts on why this might be?

Constantin · June 9, 2024, 7:46pm

Great suggestion - can I wait until you have confirmed that the sVDEV replace + upgrade procedure is no different than a standard drive swap in a VDEV?

As for the difference, that is a great question and I do not have a good answer. I don’t think those numbers are necessarily related, as my output charts are not in sync either.

I’d keep a much closer watch at how full the sVDEV is relative to pool capacity (i.e. your first screen shot), making sure that your pool capacity is more than 3% filled with content by now. Otherwise, you might run out of metadata room.

If I’m reading your second screenshot correctly, that’s suggesting 0.13% metadata relative to the pool size, which is still ~2x less than the 0.3% suggested by rules of thumb on this and other forums.

Stux · June 9, 2024, 10:43pm

You have a pool with two Samsung drives as a mirrored special vdev?

Is there a reason you don’t just put the dell drive in, and replace one of the Samsungs (ie in the ui)

When that finishes, take that Samsung out. Repeat with the other dell.

Ie use online replacement.

Constantin · June 9, 2024, 10:47pm

This is a better plan, IF you have a spare drive slot.

Sara · June 9, 2024, 11:10pm

Are you expecting lots of sync write you wan’t to speed up. Otherwise, I don’t think PLP is that important for metadata vdev and more important for SLOG.

This is something I always wondered.
Can anyone tell me if this is really true?
I have heard this from an IT friend:

Assuming you have Sam1 and Sam2 as mirror.
You wanna replace Sam1 with Intel1 and Sam2 with Intel2.

There are two options.

You could click replace on Sam1 to replace it with Intel1
you could create a 3way mirror and then downgrade that later on

The later option is safer. Why? According to him , as soon as you use click to replace Sam1, ZFS will not use Sam1 anymore. So if there is a read error from Sam2, ZFS will not use Sam1 in this scenario while it does for the second option.

Stux · June 9, 2024, 11:12pm

The first option is effectively an automated version of the second option.

I believe (but may be wrong) that zfs will use a drive being replaced as a source, if it is online.

Otherwise it would be impossible to replace a stripe drive

dxun · June 22, 2024, 6:12am

This is exactly what I had done. Inserted a pair of new SSDs, did the replace one by one. Worked like a charm and exactly as I had expected.

Thank you all for your help!

EDIT: Wanted to address this interesting point by @Sara

I don’t think PLP is that important for metadata vdev and more important for SLOG

Is PLP actually relevant for “ordinary” (non-sync writes) pools? I would imagine it is, as you could have a (really) unlucky loss of power during a metadata write, have this write op stuck in the DRAM buffer of one SSD mirror member and not have ZFS have the time to write it to the other SSD mirror member (assume 2-way).

Correct me if I am wrong, but this would not be great for that pool?

Sara · June 22, 2024, 4:56pm

I assume this behaves the same as for none special vdev vdevs:
Your described scenario is what sync writes are for.
Unless you get some awful Phision E18 controller with a wonky ADATA or Gigabyte firmware, in general even consumers SSDs don’t lie about writes being written into NAND.

dxun · June 22, 2024, 6:30pm

Yes, my understanding is a general assumption for any vdev benefiting from PLP - I’d rather err on the side of caution when dealing with a critical sVDEV.

The Pliant LB206s I’ve done the replacement are indeed coming from a golden era of SSDs - these are enterprise, DRAM-less true SLC drives rated with > 100k P/E cycles per NAND cell. They aren’t speed demons (especially by today’s standards) but they also don’t have PLP as they don’t need it. I hope they’ll do nicely for the task at hand.