NVMe Nightmare

torgeir · May 11, 2024, 9:29am

I’ve recently build my second TrueNAS Scale (dragonfish) system, on an Poweredge R730.
My plan was to use this as my main “multi purpose NAS”, so in addition to my regular “big pool” of a 7 wide z2 pool, i decided to go for an “fast pool” aswell, based on 3 NVMe 1TB (wd sn750) gumsticks.

Specs:
2 x E5-2660 v4 (28 cores/56 threads)
756GB of DDR3 1833mhz RAM
2 x WD SN750 1 TB (m.2 to PCIe adapter)
1 x WD SN750 SE 1 TB (m.2 to PCIe adapter)
4 x Qlogic 10GBe SFP+ (not relevant as tests are done locally on the truenas)
HBA200 controller flashed to IT-mode (not relevant for NVMEs)
LSI-9200-8e controller flashed to IT-mode (not relevant for NVMEs)
2 x intel Optane 32gb (SLOG for the HDD pool) (m.2 to PCIe adapter)
2 x WD SN730 (Soon to be metadata vdev for the HDD pool) (m.2 to PCIe adapter)

Dell bios updated to latest version. The machine has 80 PCI lanes (v3.0), so that should not be a problem.

And theese NVMe drives is driving me crazy! I cant not achieve even 10% of the performance that one of theese drives are capable of! I do understand that benchmarking theese drives in ZFS is an complicated process, and i do realize that i dont understand the complexity about queue sizes, queue depths, ashift and sector sizes, but i have tried so many combinations of configuration and benchmarks, that i am sure there is something really wrong here.

Theese are 4k sector drives, so my research has concluded that ashift should be 12, the default.
Alltho i’ve tried 13, without any difference in performance.

I’ve tried 4k record size, 16k, 32k and 128k (default) on the datasets.
Compression, dedup and atime is disabled.
ARC disabled on the testing datasets (zfs set primarycache=none)

None of theese combinations does anything to performance. Everything performs bad.
On a single drive (1x stripe vdev) on the SN750 SE i cant get over 250MB/s write.
On a mirror vdev (2x drives) same bad performance.

I’ve tested all kinds of different settings with “FIO”.
Here is an example, where i should ATLEAST get 1GB/s (or 3GB/s to be honest, as this is PCI-E 3.0):

fio --bs=128k --direct=1 --ioengine=posixaio --iodepth=32 --group_reporting --name=write --numjobs=8 --ramp_time=5 --run

time=30 --rw=write --size=10G --time_based
write: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=posixaio, iodepth=32
...
fio-3.33
Starting 8 processes
Jobs: 8 (f=8): [W(8)][8.6%][w=60.6MiB/s][w=485 IOPS][eta 06m:25s]       
write: (groupid=0, jobs=8): err= 0: pid=1399108: Sat May 11 11:21:09 2024
  write: IOPS=1606, BW=202MiB/s (212MB/s)(6138MiB/30408msec); 0 zone resets
    slat (nsec): min=1926, max=537381, avg=9393.23, stdev=8239.39
    clat (msec): min=35, max=872, avg=158.19, stdev=143.53
     lat (msec): min=35, max=872, avg=158.20, stdev=143.53
    clat percentiles (msec):
     |  1.00th=[   37],  5.00th=[   38], 10.00th=[   39], 20.00th=[   42],
     | 30.00th=[   53], 40.00th=[   73], 50.00th=[  102], 60.00th=[  138],
     | 70.00th=[  184], 80.00th=[  257], 90.00th=[  409], 95.00th=[  498],
     | 99.00th=[  550], 99.50th=[  550], 99.90th=[  617], 99.95th=[  693],
     | 99.99th=[  776]
   bw (  KiB/s): min=26368, max=866773, per=100.00%, avg=208420.60, stdev=25823.34, samples=480
   iops        : min=  206, max= 6771, avg=1628.15, stdev=201.74, samples=480
  lat (msec)   : 50=28.28%, 100=21.31%, 250=30.08%, 500=15.85%, 750=4.74%
  lat (msec)   : 1000=0.01%
  cpu          : usr=0.29%, sys=0.05%, ctx=12354, majf=0, minf=5785
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=10.2%, 16=64.8%, 32=25.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=95.9%, 8=0.9%, 16=1.9%, 32=1.3%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,48852,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=202MiB/s (212MB/s), 202MiB/s-202MiB/s (212MB/s-212MB/s), io=6138MiB (6436MB), run=30408-30408msec

During the benchmark, i can see the speed bumps up to about 1GB/s for a short period, before it ramps down to 50MB/s. I am fully aware that ZFS is not an performance oriented filesystem, but this… i have better big-block seq writes on my HDDs than the NVMEs.

An “dd if=/dev/null of=nvmedrive bs=10g count=1” confirms the shitty performance.

I dont know where to look for bottlenecks here. No controller involved. Temperatures in the NVMe’s are fine (40c). Anyone has any idea what might be going on here? Really appreciate advices here

torgeir · May 11, 2024, 9:34am

Here is another fio test, using 4K blocksize.
Test starts OK, with 800MB/s and over 100 000 IOPS. Then after a few gigabytes of write, it drops down to 50MB/s and 1000 IOPS.

Dont mind the relatively high system load at this time… there is an scrub going on at the HDD pool at this time. Results are the same with low system load.

tfio --name=testing2 --bs=4k --direct=1 --size=5G --rw=write
testing2: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=1
fio-3.33
Starting 1 process
testing2: Laying out IO file (1 file / 5120MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=233MiB/s][w=59.6k IOPS][eta 00m:00s]
testing2: (groupid=0, jobs=1): err= 0: pid=1594477: Sat May 11 11:33:17 2024
  write: IOPS=60.0k, BW=234MiB/s (246MB/s)(5120MiB/21834msec); 0 zone resets
    clat (usec): min=4, max=1378, avg=15.92, stdev=25.39
     lat (usec): min=4, max=1378, avg=16.02, stdev=25.42
    clat percentiles (usec):
     |  1.00th=[    5],  5.00th=[    6], 10.00th=[    6], 20.00th=[    7],
     | 30.00th=[    8], 40.00th=[    8], 50.00th=[    9], 60.00th=[   12],
     | 70.00th=[   14], 80.00th=[   16], 90.00th=[   22], 95.00th=[   34],
     | 99.00th=[  135], 99.50th=[  137], 99.90th=[  149], 99.95th=[  161],
     | 99.99th=[  221]
   bw (  KiB/s): min=50424, max=507592, per=99.92%, avg=239937.67, stdev=161784.43, samples=43
   iops        : min=12606, max=126898, avg=59984.42, stdev=40446.11, samples=43
  lat (usec)   : 10=53.89%, 20=34.60%, 50=7.14%, 100=0.12%, 250=4.25%
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%
  cpu          : usr=10.38%, sys=63.52%, ctx=56567, majf=0, minf=25
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1310720,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=234MiB/s (246MB/s), 234MiB/s-234MiB/s (246MB/s-246MB/s), io=5120MiB (5369MB), run=21834-21834msec

torgeir · May 12, 2024, 11:25am

So, theese drives does NOT come in 4k format out of the box. They come in 512. So ashift=12 is wrong.
I reformatted the nvme drives to 4k by running:

nvme format /dev/nvme1n1 -l 1

Problem solved. Excelent performance!

ericloewe · May 12, 2024, 1:30pm

“Wrong” is a very strong statement. There’s no particular reason for ashift=12 to cause poor performance, outside of ridiculous edge cases involving mountains of 512-byte writes.

It’s far more likely that the SSDs were effectively reset by changing the block size. That’s good because it solves your immediate problem, but bad because it suggests that their long-term performance is going to suffer from use.

joeschmuck · May 13, 2024, 1:45am

I do not think the ashift=12 value was incorrect. Have you checked it since your format command?

A little information as to why your speeds are super fast… You erased all the data on the NVMe thus removing a very time consuming erase command when you write data. Once the drive needs to rewrite over used blocks of memory, it will slow down again. It takes time to erase before writing. That is what I think is going on.

Your format command used -l 1 which means to format into 512 bytes. You should have used -l 2 to format to 4096 bytes.

I really am curious what your ashift is right now.

Reference for the nvme command:

torgeir · May 13, 2024, 7:51am

“Wrong” is a very strong statement. There’s no particular reason for ashift=12 to cause poor performance, outside of ridiculous edge cases involving mountains of 512-byte writes.

Okay, maybe “wrong” was too hard statement to use, but according to the OpenZFS documentation, NVMe’s work best in 4k, and ashift should be set to match the disk format for best performance: Hardware — OpenZFS documentation

Your format command used -l 1 which means to format into 512 bytes. You should have used -l 2 to format to 4096 bytes.

No, not on my drives.
0 = 512 and 1 = 4k. Here is the output from “smartctl”:

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf

0 - 512 0 2*
1 + 4096 0 1*

And here is the part of the smartctl output that shows that i am now running 4k instead of 512 (as it showed before the reformat):

Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 8215
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size: 4096

I really am curious what your ashift is right now.

My ashift isstill 12:

zpool get ashift fastpool
NAME PROPERTY VALUE SOURCE
fastpool ashift 12 local

torgeir · May 13, 2024, 8:07am

OK, so lets say that was the case, my performance issue could be resolved by running a “zpool trim” on the pool?

Is the behavior you describe expected on fast NVMe SSD? What do you do to mitigate this issue and keep performance top notch?

If i now write 2 TB to the pool, i would meet the same performance issues again?

joeschmuck · May 13, 2024, 1:40pm

If Trim is operating as it should, the empty blocks should be erased and yes, that should mitigate that issue. Do you have Trim enabled? I don’t know if it is enabled by default on TrueNAS to be honest. Guess I should know that and for which versions of TrueNAS as well.

If you do write all that data to the NVMe, I do expect it to slow down. If you by chance do this, and I’m not saying anyone should fill up any solid state storage as it takes life out of the drive, but if you do, I’d be curious to know if the speed is affected. Maybe you could make a reminder to run the tests again in a month or two, assuming you plan to put a lot of data on the system. This would be good to know for a lot of people. Practical testing is better than theoretical.

I am a little surprised but then again it is NVMe version 1.3. Of course manufacturers do what they will. It is good to see that your format is what you wanted. Did you look at this before you reformatted it and did it say 512?

Now I need to check my NVMe drives to see what they are formatted as, not that it really matters to me, speed is not a factor for my build but knowing the answer is actually important to me.