I’ve recently build my second TrueNAS Scale (dragonfish) system, on an Poweredge R730.
My plan was to use this as my main “multi purpose NAS”, so in addition to my regular “big pool” of a 7 wide z2 pool, i decided to go for an “fast pool” aswell, based on 3 NVMe 1TB (wd sn750) gumsticks.
Specs:
2 x E5-2660 v4 (28 cores/56 threads)
756GB of DDR3 1833mhz RAM
2 x WD SN750 1 TB (m.2 to PCIe adapter)
1 x WD SN750 SE 1 TB (m.2 to PCIe adapter)
4 x Qlogic 10GBe SFP+ (not relevant as tests are done locally on the truenas)
HBA200 controller flashed to IT-mode (not relevant for NVMEs)
LSI-9200-8e controller flashed to IT-mode (not relevant for NVMEs)
2 x intel Optane 32gb (SLOG for the HDD pool) (m.2 to PCIe adapter)
2 x WD SN730 (Soon to be metadata vdev for the HDD pool) (m.2 to PCIe adapter)
Dell bios updated to latest version. The machine has 80 PCI lanes (v3.0), so that should not be a problem.
And theese NVMe drives is driving me crazy! I cant not achieve even 10% of the performance that one of theese drives are capable of! I do understand that benchmarking theese drives in ZFS is an complicated process, and i do realize that i dont understand the complexity about queue sizes, queue depths, ashift and sector sizes, but i have tried so many combinations of configuration and benchmarks, that i am sure there is something really wrong here.
Theese are 4k sector drives, so my research has concluded that ashift should be 12, the default.
Alltho i’ve tried 13, without any difference in performance.
- I’ve tried 4k record size, 16k, 32k and 128k (default) on the datasets.
- Compression, dedup and atime is disabled.
- ARC disabled on the testing datasets (zfs set primarycache=none)
None of theese combinations does anything to performance. Everything performs bad.
On a single drive (1x stripe vdev) on the SN750 SE i cant get over 250MB/s write.
On a mirror vdev (2x drives) same bad performance.
I’ve tested all kinds of different settings with “FIO”.
Here is an example, where i should ATLEAST get 1GB/s (or 3GB/s to be honest, as this is PCI-E 3.0):
fio --bs=128k --direct=1 --ioengine=posixaio --iodepth=32 --group_reporting --name=write --numjobs=8 --ramp_time=5 --run
time=30 --rw=write --size=10G --time_based
write: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=posixaio, iodepth=32
...
fio-3.33
Starting 8 processes
Jobs: 8 (f=8): [W(8)][8.6%][w=60.6MiB/s][w=485 IOPS][eta 06m:25s]
write: (groupid=0, jobs=8): err= 0: pid=1399108: Sat May 11 11:21:09 2024
write: IOPS=1606, BW=202MiB/s (212MB/s)(6138MiB/30408msec); 0 zone resets
slat (nsec): min=1926, max=537381, avg=9393.23, stdev=8239.39
clat (msec): min=35, max=872, avg=158.19, stdev=143.53
lat (msec): min=35, max=872, avg=158.20, stdev=143.53
clat percentiles (msec):
| 1.00th=[ 37], 5.00th=[ 38], 10.00th=[ 39], 20.00th=[ 42],
| 30.00th=[ 53], 40.00th=[ 73], 50.00th=[ 102], 60.00th=[ 138],
| 70.00th=[ 184], 80.00th=[ 257], 90.00th=[ 409], 95.00th=[ 498],
| 99.00th=[ 550], 99.50th=[ 550], 99.90th=[ 617], 99.95th=[ 693],
| 99.99th=[ 776]
bw ( KiB/s): min=26368, max=866773, per=100.00%, avg=208420.60, stdev=25823.34, samples=480
iops : min= 206, max= 6771, avg=1628.15, stdev=201.74, samples=480
lat (msec) : 50=28.28%, 100=21.31%, 250=30.08%, 500=15.85%, 750=4.74%
lat (msec) : 1000=0.01%
cpu : usr=0.29%, sys=0.05%, ctx=12354, majf=0, minf=5785
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=10.2%, 16=64.8%, 32=25.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=95.9%, 8=0.9%, 16=1.9%, 32=1.3%, 64=0.0%, >=64=0.0%
issued rwts: total=0,48852,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
WRITE: bw=202MiB/s (212MB/s), 202MiB/s-202MiB/s (212MB/s-212MB/s), io=6138MiB (6436MB), run=30408-30408msec
During the benchmark, i can see the speed bumps up to about 1GB/s for a short period, before it ramps down to 50MB/s. I am fully aware that ZFS is not an performance oriented filesystem, but this… i have better big-block seq writes on my HDDs than the NVMEs.
An “dd if=/dev/null of=nvmedrive bs=10g count=1” confirms the shitty performance.
I dont know where to look for bottlenecks here. No controller involved. Temperatures in the NVMe’s are fine (40c). Anyone has any idea what might be going on here? Really appreciate advices here