Metadata VDEV impact is "not noticeable" / it "does not perform as expected"

My TrueNas Scale system has the following storage set-up:

  • One 16 TB HB as Archive
  • One 16 TB HD with a 1TB Metadata NVME added
  • One 4TB NVME SSD for VM’s etc

Since I did receive some comments related to the usefulness of the MetaDataVDEF I decided to do some tests.

Note that I did configure the MetaData SSD in such a way that it would store:

  • all metadata and all relatively small files 500K? (strange! that the menu’s do not show what is exactly configured)

The test setup is as follows:

  • 64bit windows11 PC with NVME-ssd (defender active)
  • The NAS in network segment-1 (VLAN-1) and the PC in segment-2 (VLAN-2)
  • VLAN’s interconnected via a pfSense firewall.
  • Network is 10G

So my first Idea was to CrystalDiskMark to measure the performance of my PC-local drives and to measure the performance of the different NAS-drives. However up to my surprise the outcome of those CrystalDiskMark test were absolutely nonsense / impossible !!! Really!!

So I decided to create a folder containing 45GB data in 322 files. Just data grepped some from the disk small and big files.

Then I did create on all TrueDas disks (HD-only, HD+vdev and pure SSD) two virtual drives:

  • one SMB and
  • one iSCSI

Then I took a stopwatch and copied the test-folder to all those six drives and back Writing down the result / copy times.

First thing to note the NVME-only drive was as expected by far the fastest (about 5 to 6 times faster and probably limited partly limmited by the 10G LAN

But then the surprise:

  • the HD without (NVME) MetaData SSD and the HD with MetaData SSD did perform more or less equal. And that was not what I had expected.

Assuming that iSCSI is treated as one file it is logical, however not in case of SMB.

So I did create a new testfolder this time one with lots of small files. Than I did redo the test

And even with lots of selected small files the impact of the MetaData VDEV was negligible. I checked if the network or the firewall where perhaps the limiting factor, however the network was not and the firewall loading was low during the transfer.

I did all tests using the default settings for iSCSI, SMB and CrystalDiskMark

So the unexpected conclusion for the moment are:

  • do not use crystaldiskmark for disk performance testing
  • I am missing the option to view what has been configured for a VDEV
  • the MetaData VDEV seem / is unexpected not raising performance, I am really wondering why! Perhaps it should be possible to raise the max file size to store on the vdev … something else. I did of course hope to get SSD performance for metadata and small files
  • perhaps the MetaDataSSD helpd for local processing … perhaps

I did some additional tests, which might explain the results:

  1. ZFS has an ARC caches which speed up thinks partly making the effect of the metadata VDEV invisible
  2. perhaps even more important … windows is very very bad processing small files.

Related to point two, I did copy my smallfile testset to an SMB data set as created on a NAS NVME-ssd and copied it back to a NVME-SSD on windows … terrible slow (~ 700 Mbit second). Copying back a data set with bigger files did perform very wel (~ 9 Gbit second). With Microsoft Defender switch off.

So the main problem seems to be the windows filesystem. May Be some impact from the pfSense firewall as well (loading during small file, transfer 15% so that seems reasonable)

You are trying to bite a bit too much in a single time: the different and proper way to accurately test you system’s performance would require ample time and space.

It’s not clear how your POOLS are structured, please provide the output of zpool status, and after we understand that we can see whether your testing can be considered valid.

A metadata VDEV improves mainly the read performances and, if properly configured, the write performances of small files: do note that this performance boost applies only to the single POOL containin the metadata VDEV and does not cover the remaining pools.

Also, please provide your full system specs.

1 Like

Related to my system, it is a truenas scaIe system latest version. Below some hardware specs

Motherboard ASUS TUF GAMING B550M-PLUS (WIFI)
CPU: AMD 5700G (disadvantage PCIE3, however very important with internal GPU).
2e GPU: GeForce GT 1030
92 GB RAM
1 x 16 TB HD
1 x 16 TB HD + 1x 1TB NVME
1x 4 TB NVME SSD
1x 500 GB SATA SSD
1x 500 GB SATA SSD as systemdisk/bootdrive
1x Mellanox Technologies MT27500 Family [ConnectX-3) 2 x 10G
2x Samsung FIT Plus 256GB USB planned to replace the sata bootdrive

sudo zpool status does only show that everything is OK

My actual feeling is that the measuring results are due to:

  • zfs cache which has more or less the same effect as the MetaDrive
  • SMB overhead (at least the windows side)
  • slow windows behavoir in case small files
  • transfers are single task / single connection

That the result is … very slow, is clear …

zpool list -v then.

My guess about the benchmarking results is that:

  • With 92GB of RAM, you probably have 64GB+ of ARC.

  • With this amount of ARC, once the metadata has been read from disk, it is in ARC and probably stays there. So ARC probably contains all the metadata for all ZFS pools, and once it is in ARC it will not be read (from HDD or NVMe) so you won’t see any impact from an NVMe metadata vDev cf HDD. You will see a benefit from NVMe after boot when metadata needs to be read from disk, but after that you will only see a benefit from a metadata vDev when you have significantly more metadata than you can possibly store in ARC.

  • Any sequential read of large files is going to trigger pre-fetch, so by the time your client requests the next block it is probably already read into ARC.

So the benefits of a metadata vDev are at best limited.

But there are some downsides…

  1. Once added, you can never remove a metadata vDev from a pool. So think carefully before you add one.

  2. The risk of losing a pool is the sum of the risks of losing an essential vDev (data, metadata, deduplication) i.e. each vDev you add increases the risk of loss. The way you counter this is to reduce the risk of losing each of the individual vDevs by employing redundancy.

  3. A single drive HDD data vDev with a single drive NVMe Metadata vDev has a higher risk of failing than the HDD drive alone. So you should only add a Metadata vDev if it is essential.

The real benefits of TrueNAS and ZFS are to allow you to have single pools across multiple drives (which increases the risk of loss) and of having redundancy (which reduces the risk of loss back down again). You are not taking advantage of these benefits.

Some NVMe slots and boards allow you to divide a single NVMe into multiple devices - or you can manually partition them. So my advice would be:

  1. Rework the HDD pool with the NVMe metadata vDev to remove the metadata NVMe. Ideally if you can afford to lose 16TB of HDD space, turn the 2x 16TB HDD drives into a single mirrored i.e. redundant pool.

  2. Do NOT use a FIT Plus Flash drive as a boot drive - it does not have the TBW capacity to be a boot drive. USB connections are generally not considered reliable enough for any ZFS drive, and certainly not a boot ZFS drive.

  3. Swap out the 1TB NVMe for a 2nd 4TB NVMe, and make your VM pool 2x 4TB NVMe mirrored vDev. If you want to, then partition the NVMe drives (or preferably use hardware splitting if your MB and NVMe cards support it) to carve out a (say) 32GB partition for use as a mirrored boot drive, and perhaps another partition for use as a mirrored apps partition. This would then free up the 2x SATA ports for additional HDDs - in which case buy another 2x 16TB drives and make a 4x 16TB RAIDZ2 pool.

All these changes would be easier before you start to put live data on this NAS - once you have data there (and maybe nowhere to put it temporarily to clear and reconfigure) it will be much harder to make changes.

1 Like

As requested. Actual situation.

  • Cheetah: Pure NVME intended for VM’s etc
  • Poema: Combination of HD and MetaData to speed up the HD
  • Olifant: HD-only Intended as Archive

NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
Cheetah 3.56T 832G 2.75T - - 16% 22% 1.00x ONLINE /mnt
3.58T 832G 2.75T - - 16% 22.8% - ONLINE
Olifant 14.5T 9.86T 4.69T - - 1% 67% 1.00x ONLINE /mnt
14.6T 9.86T 4.69T - - 1% 67.8% - ONLINE
Poema 15.5T 2.66T 12.8T - - 0% 17% 1.00x ONLINE /mnt
14.6T 1.98T 12.6T - - 0% 13.6% - ONLINE
special - - - - - - - - -
932G 698G 230G - - 0% 75.2% - ONLINE
Temp 400G 610M 399G - - 0% 0% 1.00x ONLINE /mnt
402G 610M 399G - - 0% 0.14% - ONLINE
boot-pool 448G 53.8G 394G - - 5% 12% 1.00x ONLINE -
sdc3 449G 53.8G 394G - - 5% 12.0% - ONLINE

Yep, I understand what you are saying. It is fore sure the metadata vdev does not have the effect I hoped for.

I am less concerned about reliability, given the fact that among the archive disk most things have a more or less copy elsewhere. Apart from that redundancy would not only double the cost but even more severe this is a relatively cheap home build machine having a very limited number of interfaces / pcie-lanes.

That is also the reason, that I am very unlucky with the fact that the boot drive is consuming an SATA interface. It is NOT the cost of an extra SATA drive !!!

I am pleasantly surprised to read something about splitting a (NVME) SSD into partitions. I thought that option did not exist (here). But I would love to steal part of e.g. the MetaData VDEV to use it as boot drive !!!

I never heard of that option, I have to dig deeper into that !!!

I understand your advice for redundancy but I will not do that. Too expensive and no available interfaces. However I do have a second NAS (core at the moment), and I plan to synchronize part of this system towards that NAS. IMHO a second system is better that local redundancy.

It’s an unsupported, unreccomended configuration.

If they are on the same site, marginally so… being still local redundancy.

What is exaclty that you want to achieve?

1 Like

In this case, it could be removed since the pool consists of single drive vdevs (= “one-way” mirrors), and I would suggest to remove it, since it fails to provide benefits in actual use.

The iGPU is useless, unless this consumer motherboard requires a GPU to boot.

Assuming that @louis is in the EU, he could replace the motherboard by a Gigabyte MC12-LE0. Server board: iGPU not needed. 6 SATA ports. Boot from a cheap M.2 drive in the onboard x1 slot. NIC in PCIe x4 slot. If the GT1030 is half-height, a x8x4x4 riser in the PCIe would host 2 M.2 NVMe drives. Total cost for these upgrades: about 100 E.
Leaving the migration to ECC RAM, with an ECC-capable Ryzen, as a next step in the ZFS journey. :wink:

1 Like

It’s amazing how many people are completely unconcerned about reliability … until, that is, something fails and they lose their data and come here asking for help getting their pool back or spend weeks recreating the bits that can be recreated.

5 Likes

This raises the question why you’re considering TrueNAS and ZFS…

It is possible but unsupported—and thus largely advised against.
HERE BE DRAGONS, and all that sort of things, especially if you’re going to mix the dispensable (boot drive) with the critical (special vdev).

To an extent, this is true, but leaves a big gap: Without local redundancy, read errors on data cannot be automatically corrected and require manually pulling back a good copy.
Some local redundancy plus a copy elsewhere is much better.

1 Like

There is a BIG difference between:

  • Redundancy - which is about keeping your data available and uncorrupted; and
  • Backup - which is about ways to recover data when it is lost or damaged

Whilst I fully understand the issues of the cost of redundancy and backup, there are ways to reduce the cost of redundancy (if you plan correctly from the start). For example, in this situation where we want 32TB of useable raw storage, we can achieve single disk redundancy for that by e.g.:

  • 2x 32TB mirrored - redundant storage 32TB (assuming that 32TB drives are available)
  • 3x 16TB RAIDZ1 - redundant storage 16TB
  • 5x 8TB RAIDZ1 - redundant storage 8TB
  • 9x 4TB RAIDZ1 - redundant storage 4TB

Storage costs per TB are fairly constant across different drive sizes, so reduced redundancy generally means lower disk costs.

Electricity costs are per drive, so more drives equals higher running costs.

So if I was buying disks afresh, I would probably go for a 5x 8TB configured as a single pool RAIDZ1 and at c. 25€/TB at present this would be c. 200€ cheaper than 3x16TB.

With all of these in a single pool I would also save time on storage administration, moving data from live pool to archive.

Finally, partitioning a boot drive is not supported, however it works just fine for me and I have not had any issues with e.g. upgrading. If I ever have to reinstall TN from scratch, I might lose the non-boot partitions during the TN install process, so I keep a replicated backup on HDD for that purpose. However I would agree with @etorix that you if you are going to use them then you should not mix metadata vDevs with anything else because they are extremely critical.

(A little research suggests that the hardware splitting I was thinking of is called Zoned Namespaces, and I am not sure that this is the same as logically splitting the NVMe into separate raw devices.)

Useful for transcoding and accelerating GUI VMs.

Intel iGPUs may support vGPU or SR-IOV.

Not sure about AMD’s offerings.

Note that I build this new Nas next to my old NAS as NAS and as Server to host things like a few (low end) private websites, test OS installations, music server, FTP-server etc. That kind of stuff.

And of cause redundancy is an issue, but this is not a NAS intended as high end enterprise storage. Of course I would not be glad to lose data however:

  1. a single NAS drive is not less reliable as an PC drive
  2. and in fact I am very secure with my data:
  • it is on my PC
  • it is on an offline backup
  • and on my New NAS and probably on my old NAS as well
  • and on my new NAS there is probably a more ore less backup on the Archive disk as well.
    So a disk outage will be very uncomfortable but should not be exaggerated

I build the NAS based on an AM4 uATX B550 board with an AMD 5700G (build in GPU) placed in a uATX tower case. Nice and small. Place for two 3,5 inch HD and many 2,5 GB drives. 4 times SATA, 2 times NVME.

In the case a 10Gbit NIC and a Nvidia 1030 GPU with the idea to install one or more OS’n in VM’s with a GUI. Problem with the 1030 GPU is that you can not share it across multiple VM’s. GPU’s with that option are costing a fortune.

So that was and still is the idea. That the MetaDrive influence would be so limited, I did not expect.

In my actual vision using the second NVME-slot would be better used as a second NVME-dataset. Not for redundancy.
Extra NVME capacity is more important to me than that redundancy. The change that a NVME -drive fails in not very high, and I can copy the dataset content to the archive disk can’t I !?

Since I have at least had not enough (sata) interfaces. I do really hate the fact that I need to install a separate boot drive. So the only work around for that was an USB-drive. So I bought two not one high quality USB-sticks to create an raid1 as boot drive as work around.

I was also considering an NVME-SSD connected to to USB for extra storage, since I prefer to invest in NVME-SSD’s above SATA-SSD’s (much slower and same cost). I am now considering changing the role of the second nvme-drive.

So that are the reasons I did build the system perhaps a year ago the way it is now.

I can not change the disk config because:

  • I all ready have those 16TB drives (and in use)
  • The case only support two 3.5 inch drives
  • There are only 4 SATA-ports
    (two for the storage drives, one boot drive, one SSD for temp files etc

Why you put the temp files (whatever that means, I am assuming small files) there instead in the NVMe pool is beyond me.

To save space. So that I can use NVME space for things which do need that speed more

I like the Gigabyte MC12-LE0 you are refering to !!

That bord does have an IPMI I would love to have :slight_smile: !

  • it lacks however the second NVME
  • it lacks TPM
    Further on it is more or less the same / a B550 just like I have now

Yes - but most PC drives are a LOT smaller than 16TB - so whilst the drive is going to be reasonably reliable, if it does fail you will lose a lot more data.

You say this as if it is a good thing, but IMO as an IT manager with decades of experience I might argue that it isn’t. The more copies of data you have, and the more different they are, the more time you spend managing them. And of course you are paying for the spinning rust for each copy, both for the disks and the electricity to spin them.

I would suggest that you need only 2 copies - an active one on your NAS, and a near-line backup (in another room of your home, or off-site). So my recommendation would be that you simplify your data storage and consolidate disks and pools as much as you can, with as few as possible redundant pools on your NAS and a single copy held elsewhere.

In my own case, I hold:

  • master data for some shared files on a redundant pool on the NAS, with backup copies on a PC;
  • master data for personal files on my PC, with automated backups to the NAS;
  • single copy of media files on a redundant pool on the NAS.
  • NAS app data on a non-redundant SSD, with replication to the HDD pool as backup in case of drive failure.

Had you asked here first, we could have advised that this would likely be the case. Obviously it is your NAS and your decision to make, but the advice you get here from experts (and I am not nearly as expert as others here) is at least worth serious consideration.

DO NOT USE FLASH DRIVES AS A BOOT DEVICE!!! They will get too many writes and fail quickly, and are genuinely unsuitable!!!

If you have to use a USB then use an SSK USB SSD (which looks like a flash drive but is a proper SSD). But IME (as someone who has experience of the pros and cons of using a single USB SSD as both boot drive and app pool), given a choice for a boot drive you would still be better off using a partitioned NVMe drive (unsupported) than a USB SSD drive (unsupported).

The lack of 3.5" slots is a significant constraint. A different case that supports 4x 3.25" slots would not be too expensive and might be a good investment.

As for reconsidering the disk layout, you have lots of other storage devices holding other redundant copies - move the data off, add two more 16TB 3.25" drives and reconfigure as a RAIDZ1. Then consolidate data onto this NAS from multiple other sources to reduce the number of copies and reduce your storage management overhead.

1 Like