Optimal drive layout for 14x NVMe, enterprise level

cyr0nk0r · May 5, 2024, 5:06am

Hello TrueNAS community. I’m looking for some guidance from others that have experience with the level of enterprise equipment that I will be deploying.

The server will be a Dell PowerEdge R7525.
2x AMD EPYC Milan 7443
512GB Micron ECC Ram (8x64 2666Mhz)
14x Samsung PM1733 15.36TB U.2 NVMe Gen 4 (possibly 16x) (Link)

The hardware selection was chosen for a few reasons. The R7525 is the only 15th generation Dell that does not oversubscribe the PCIe lanes for the NVMe drives. (more information)

The EPYC 7443 was a tradeoff between cost versus performance. But of the available CPU options for the refurb vendor that I bought the server from, the 7443 represented the 3rd best single threaded performance (for SMB workloads), while still giving satisfactory overall performance.

The workloads will be mostly large datasets in the 50-80+ TB range, for large uncompressed CAD modeling simulations and other engineering data. We expect 40+ engineers hitting this storage via SMB all day every day. Probably 70% read workload, with 30% writes. Expect somewhere around 15Gb/sec throughput to it pretty consistently.

Now on to the main crux of the original question. I am looking for some drive layout guidance for the 14x (possibly 16x) NVMe drives. Traditional wisdom for RAIDZ layouts really have not caught up to the kind of performance that can be extracted from NVMe drives.

Scrubs and resilver times are not really a concern. The whole argument of “what if a drive fails during rebuild” doesn’t really factor in either. These drives MTBF are measured in the millions of hours. The write endurance is measured in the dozens of petabytes.

What my original thought was to simply do a 14x RAIDZ1 giving me an estimated ~184TB usable, however I’ve been going back and forth if I should be doing a RAIDZ2, or perhaps 2x RAIDZ1’s with 7 drives, etc. I might be over-thinking things, but I would love some guidance on those that have deployed servers with this level of horsepower, especially with this number of NVMe drives and what your experience has been.

Stux · May 5, 2024, 6:54am

I have not deployed servers with this level of horsepower/nvme, but I have heard that the ARC itself can slow down the storage and it’s best to configure it for metadata only.

I would probably use 2 x raidz1.

The question then is, you’re almost using half the bays… did you want to reserve the others for expansion?

And would it make sense to ensure that on split the drives between CPUs? Or maybe it’s better to have all of the drives on one cpu…

You should throughly test before putting into production, preferably with a real clone of the actual dataset…

Rand · May 5, 2024, 9:27am

Historically using a RaidZ layout was limiting write performance significantly since individual drives (especially HDDs but to a certain extent also SSDs) are limited in their individual capabilities.
As such, having a single RaidZ vdev meant limiting the write performance to what a single drive could do.
The solution was to use more vdevs, multiplying the (aggregated) performance by the number of vdevs.

The same is still true for nvme, but with the significantly increased performance of an individual nvme drive this is less of a problem. This is further supplemented by the fact that nvme really shines under heavier load (ie multiple parallel processes accessing it, except Optane based drives which also work well on lower threadcounts).

However, the basic problem is still valid, just on a higher performance level.

So the general rule of thumb is that per raidz vdev you get the write performance of a single drive (it might be a bit less dire nowadays, but its a good starting point).

So if you think the 40 engineers’ task should work fine on a single nvme drive then that’s fine. If you think (test it to be) overloaded then having more vdev’s would be recommended.
Whether you run multiple raidZ vdevs to maximize space, or run multiple mirrors to maximize performance (given that you have enough parallel threads) is up to your requirements.

Another aspect is expandability. At this point you cannot add drives to a raidZ pool, you only can add new vdevs. Not sure if you can add odd sized vdevs to an existing pool nowadays, but it used to be identical sized ones only.
So if you have a 14 drive Z1 (which is totally not recommended by the way, 11 drives was the recommended limit iirc) you would need to add 14 more drives in a second vdev. If you use 2x7 Z1 instead you can add 7 more dirves for a third vdev and so on.

This usually is considered a benefit of having mirrors, its very easy to expand it with just two drives and you quickly see a performance improvement.
The downside is still that you loose a lot of space (50% of total), and o/c if the wrong two drives fail your pool is toast. Usually not so much an issue with NVME since at first drive failure you can replace it and nvme mirror resilvers fairly quickly.

cyr0nk0r · May 5, 2024, 3:17pm

It’s funny you mention CPU mapping for the NVMe drives. I asked the community about that exact thing over a month ago with no response.

I did however get some good guidance on the topic over at the openzfs github.

Stux · May 5, 2024, 9:25pm

That forum went read-only a few days after your post. Unfortunate timing.

Plus, perhaps people don’t know

Stux · May 5, 2024, 9:50pm

That is a good thread

Amotin (@mav ) is an IX employee FWIW

So, basically, don’t worry about the numa stuff. Either disable half your lanes and cpus or forget about it since ZFS is not smart enough anyway.

Use metadata only cache.

When you expand your raidz, then perhaps consider a rebalance script.

And id suggest verifying the sync speed of the Samsungs, and if necessary think about slog.

cyr0nk0r · May 6, 2024, 4:23am

So the general rule of thumb is that per raidz vdev you get the write performance of a single drive

Can you link to any documentation that explains this more.

At this point you cannot add drives to a raidZ pool, you only can add new vdevs.

I feel by the time we come up on the limits of the existing storage, the new code to expand existing vdev’s will have made it into truenas.

So if you have a 14 drive Z1 (which is totally not recommended by the way, 11 drives was the recommended limit iirc)

Says who. And WHY? Again, I really feel like this is old spinning rust thinking. We’re in a new world now. NVMe’s change the game.

Rand · May 6, 2024, 4:40am

Adheres to IOPS, not streaming writes, my bad.

And yes, 11 drive limit might well be a historical remnant that is invalid for nvme. It probably got to do with resilvering speed. If you don’t like it, ignore it. ZFS will work either way.

cyr0nk0r · May 6, 2024, 5:09am

That is kind of my point. That white paper is from 4 years ago. NVMe has come a long way in that time. In addition, that whitepaper is using disks that have a few thousand IOPS and read speeds of 1,200 MB/s and write speeds of 100 MB/s.
We’re measuring NVMe in the millions of IOPS with reads and writes in the GB/sec now. I think NVMe requires us to completely rethink how we approach these traditional anecdotes.

Rand · May 6, 2024, 5:33am

U might be right you might be wrong:)
The only people who might know are usually in the employment of companies who sell those systems for half a mil plus and those companies usually don’t share their details.

That leaves the ‘free’ user with few options -

-trust potentially old/outdated info,
-find someone who has knowledge of hw as close to your’s as it can be and who is willing to tell you about it
-just run a comprehensive set of tests yourself (and potentially share the results)

In the end almost never anyone has your exact setup and requirements, so testing yourself is something that you always should do anyway.

In your particular use case the question seems to be more of a process issue

can you afford (space wise) to run two RaidZ vdevs
Can you perform a real test with going back if you have to?

If 1=no then there is no alternative
If 1=yes and 2=no then 2 vdevs it is
If 1=yes and 2=yes then just test 1 vdev and if its not good enough adjust. rinse and repeat till you find the optimal space/performance config for your setup.

cyr0nk0r · May 6, 2024, 5:48am

Yes, I am comfortable running 2 vdev’s. But if I run 2, I’m probably going to buy 2 more drives and make it a clean 16 drives … using 2x8 RAIDZ1. Estimated to give me ~199TB usable.

That way, if we expand, I’ll add another 8 drives and that will complete the 24x backplane.

Stux · May 6, 2024, 5:51am

Also, not sure how complex the parity and compression calculations are vs the bandwidth.

Mirrors have no parity calculations.

The iops of a vdev = iops of a single disk rule still applies with nvme, it’s just the iops can be much much higher.

etorix · May 6, 2024, 12:35pm

4 years is not much. And “new” technology never makes miracles.
SSDs having much lower URE rate and faster resilver may well “ressurect” RAID5/raidz1; it does not make arbitrarily wide vdevs a good idea.

ZFS was designed when SSDs did not exist. Some of the concepts and code might scale to accomodate unexpected levels of performance from; some other parts are known to NOT scale well (e.g. anything involving cache). And NVMe performance also bring unexpected bottlenecks. @NickF1227 has provide some good food for thought with data showing that adding a PCIe switch may actually increase performance of NVMe pools by offloading bifurcation workload from the CPU.

Capacity storage on NVMe is not much explored yet. So the questions you’re asking probably have no anwser yet.

For the stated application in the GitHub thread, namely indefinite storage of large static media files, I think that HDDs are not dead yet. Or maybe you could consider tiered storage: A NVMe pool for files which still have to be accessed frequently, and a HDD pool for long term storage.

mav · May 7, 2024, 4:17pm

I’d like to accentuate that creating wider vdev does not automatically mean you get better space efficiency. It is true only if you have blocks (and files) big enough for each device in the vdev to get at least 8KB or better 32KB after compression. So in case of 8-wide RAIDZ1 you’d better have block sizes after compression of at least 256KB. Just last week I’ve had couple cases when user put ZVOL with 32KB volblocksize onto 9-wide RAIDZ2, as result getting both worst possible IOPS, highest possible CPU overhead and space efficiency of only 66%. He could get the same space efficiency but at 3x IOPS would he use 3x 3-wide RAIDZ1, or 80% efficiency at 2x IOPS with 2x 5-wide RAIDZ1. While on NVMe devices IOPS are cheaper, per-IOPS CPU overhead may just explode, so it should better be for a good reason.

cyr0nk0r · May 7, 2024, 4:36pm

The workloads are 90% files that are 20-40GB each. Very few files will be that small.

We bit the bullet and ordered 2 more drives this morning. So we’re now targeting 2x8-wide RAIDZ1 with an estimated usable space of ~200TB.

We plan to follow all of the tuning recommendations from the thread below.

Stux · May 7, 2024, 4:45pm

Sounds like you need to ensure that you’re not using the default 128KiB block size on the dataset.

cyr0nk0r · May 7, 2024, 4:58pm

Correct, we will be setting the record size to 1M.

zfs set recordsize=1M pool/dataset