Why are NVME Speeds Miserable on ZFS?

Hi,

I recently got a bunch of used Intel NVME SSDs (DC P4510) to use with zfs and Truenas. I had tried 980 Pros before and the results were very disappointing, and I assumed it was to do with old, second hand or non-enterprise drives. As a test, I tried 10x Intel in stripe in Truenas with all default settings expect for compression turned off. The results on a Dell 24 NVME bay R740xd server were very disappointing so I tried on a newer Gigabyte server but got much the same results. I also tried ButterFS on Debain just to see if it was a ZFS issue but got similar results.

In all cases and all machines that I tested, direct speeds on a single NVME drive without ZFS were much faster than a 10x ZFS or ButterFS stripe. This appears to be an NVME specific issue as I got very respectable speeds (R: 209k/W: 89.7k IOPS) with 10x stripe on a 12G SAS JBOD shelf using 5x PM1643a and 5x WD SC550 SAS SSDs and a HBA.

Results:

System / Configuration OS / Mode Read IOPS Read BW Write IOPS Write BW Avg Latency
Dell R740XD (10x Stripe) TrueNAS 12.5k 97 MiB/s 5.3k 42 MiB/s 28.6 ms
Dell R740XD (Single Disk) Debian Raw 227.0k 1771 MiB/s 97.2k 759 MiB/s 1.4 ms
G293-S42-AAP1 (2x Stripe) TrueNAS 13.5k 105 MiB/s 5.7k 45 MiB/s 26.6 ms
G293-S42-AAP1 (Single Disk) TrueNAS Raw 199.0k 1553 MiB/s 85.2k 666 MiB/s 1.8 ms

And the specs:

Feature Dell PowerEdge R740xd Gigabyte G293-S42-AAP1
CPU Dual Intel Xeon Gold 6130 (Skylake) Single Intel Xeon Gold 6538Y+ (Emerald Rapids)
Cores/Threads 16 Cores / 32 Threads 32 Cores / 64 Threads
Base Clock 2.10 GHz 2.20 GHz
Memory Capacity 16GB (2 x 8GB) 32GB (1 x 32GB)
Memory Type DDR4-2133 ECC DDR5-4800 ECC
Storage Tier NVMe (10-Disk Stripe) NVMe (2-Disk Stripe)
OS Environment TrueNAS / Debian TrueNAS / Debian

I tried all sort of things but everything I tried only made negligible difference including block size and sync off. The FIO command I used for all result including the SAS JBOD was:

fio --name=db_oltp --filename=/dev/nvme1n1 --size=20G --rw=randrw --rwmixread=70 --bs=8k --direct=1 --ioengine=libaio --iodepth=64 --numjobs=8 --runtime=60 --time_based --group_reporting

Can anyone shed any light on this or point me in the right direction for getting the best speeds form these NVME drives? I can’t find much online specifically about ZFS and NVMEs.

What do you mean when you say TrueNAS vs. TrueNAS raw?

You make no mention of virtualisation but specifically saying that one thing is raw implies that something isn’t. If that is an accurate understanding, please elaborate on how the virtualisation is configured, given that your single disk TrueNAS raw setup is within ~10% to your single disk Debian raw.

No virtualisation, just straight on the trueness machine via SSH or shell in the UI. I didn’t want virtualisation or network overheads to confuse the result. When I say raw I mean on /dev/nvme1n1 instead rather than the pool which was /mnt/nvme/testfile. Hope that makes sense.

In that case I would guess would be that it’s related to some form of some oddity in the configuration of the drives on the 24-slot NVMe backplane. For example, the system appears to benefit from populating specific slots in order to control which CPU(s) gets the load.

I had considered that and that’s why I tried 2x stripe as well. I also tried moving the drives to all be on the PCIe lanes from one CPU only. The Gigabyte server only has one CPU in it, so only 2 of the 4 NVME bays work anyway.

When I tried the 980 Pros a while back, I was using virtualisation and I tried different NUMA settings but it all made no difference. This was on an old R730xd which still performed very well with SAS SSDs. Until I saw the performance of the SAS SSDs, I assumed that ZFS was not best suited to SSDs.

What is your CPU utilisation during a test?


Also, I’ve run similar tests over my pm983/d5-p5530 mirror:

## encryption&compression ON
Run status group 0 (all jobs):
   READ: bw=589MiB/s (618MB/s), 589MiB/s-589MiB/s (618MB/s-618MB/s), io=34.5GiB (37.1GB), run=60001-60001msec
  WRITE: bw=253MiB/s (265MB/s), 253MiB/s-253MiB/s (265MB/s-265MB/s), io=14.8GiB (15.9GB), run=60001-60001msec

## encryption&compression OFF
Run status group 0 (all jobs):
   READ: bw=1565MiB/s (1641MB/s), 1565MiB/s-1565MiB/s (1641MB/s-1641MB/s), io=91.7GiB (98.5GB), run=60001-60001msec
  WRITE: bw=671MiB/s (704MB/s), 671MiB/s-671MiB/s (704MB/s-704MB/s), io=39.3GiB (42.2GB), run=60001-60001msec

It’s not something I really looked at, but Gemini sent me down a rabbit hole of high core vs higher vs faster single core performance. So I did check a couple of times and CPU was 6% with Debian and I think 23% with truenas on one of the tests I looked at. I only checked on the Dell R740xd. I don’t imagine nvme storage is more intensive on the CPU then than SAS though.

One of the things that sticks out is the latency differences which seem quite high and I imagine bad for IOPS.

IMO, you should repeat the test with disabled compression anyway. Also, some sources state that database datasets should be tuned with logbias=throughput. OTOH, it probably would make latency even worse…

Compression was off for everything. Also tried sync off.

1 Like