Fangtooth 24.04.02 disk benchmarks -- Docker, LXC mounts, as good as bare metal -- LXC RootFS much slower

@yorick, I finished up some hardware reconfigs (moving my NVMe mirror to an appropriate number of PCIe lanes) and thought I’d run some benchmarks.

I upgraded to 25.04.02 (from 24.10.2.2) and everything went fine. I made sure my classic VMs had a VNC password, and yep, the autostart one was up and running after the upgrade as if nothing happened.

I then ran some fio tests on my NVMe mirror (a couple of Micron_7450_MTFDKCC3T2TFS), and noticed that the LXC container got much worse performance – while bare OS and Docker test were comparable. I did some unsupported CLI reconfig of the container (to run in privileged mode), and it then tested in line with the bare metal and Docker tests.
Then, I recreated the container with a passed in mount to a dataset on my pool, and tested disk access on it – and that performance was again comparable to the bare metal and Docker numbers.
So, essentially the RootFS on the LXC is much slower.

Is this as expected, or am doing something wrong in LXC? There didn’t seem to be much to config with the container. Is there a faster way to run the LXC RootFS, or do you just need to make sure the performance dependent load is on passed in disk, and isn’t on the RootFS?

Summary of Results:

Metric Bare Metal (Host) Docker Container Privileged LXC Unprivileged LXC (Bind Mount) Unprivileged LXC (Container RootFS)
Read BW 1610 MiB/s 1597 MiB/s 1578 MiB/s 1621 MiB/s 184 MiB/s
Write BW 1611 MiB/s 1598 MiB/s 1579 MiB/s 1621 MiB/s 183 MiB/s
Read IOPS 12,900 12,800 12,600 13,000 1,468
Write IOPS 12,900 12,800 12,600 13,000 1,467
Avg. Read Latency 0.41 ms 0.24 ms 0.66 ms 0.64 ms 6.04 ms
Avg. Write Latency 0.05 ms 0.23 ms 0.29 ms 0.28 ms 2.12 ms
99th Percentile Latency (Read) 7.33 μs 1.27 ms 1.30 ms 1.27 ms 8.85 ms
99th Percentile Latency (Write) 2.83 μs 1.27 ms 1.32 ms 1.29 ms 8.98 ms
CPU Usage (Sys) 17.42% 17.56% 18.27% 18.64% 3.52%

Raw fio results

Bare Metal

root@TrueNAS02[/mnt/TANK_NVME/fio]# fio --name=test_random-rw --filename=./testfile --size=131072MiB --bs=128K --iodepth=2 --rw=randrw --direct=1 --numjobs=6 --runtime=150 --ioengine=libaio --time_based=1 --group_reporting

test_random-rw: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=2
...
fio-3.33
Starting 6 processes
Jobs: 6 (f=6): [m(6)][100.0%][r=1575MiB/s,w=1591MiB/s][r=12.6k,w=12.7k IOPS][eta 00m:00s]
test_random-rw: (groupid=0, jobs=6): err= 0: pid=13916: Mon Aug  4 15:27:49 2025
  read: IOPS=12.9k, BW=1610MiB/s (1688MB/s)(236GiB/150002msec)
    slat (usec): min=12, max=101781, avg=405.82, stdev=449.22
    clat (usec): min=4, max=101793, avg=238.10, stdev=376.15
     lat (usec): min=39, max=102803, avg=643.92, stdev=615.68
    clat percentiles (usec):
     |  1.00th=[   30],  5.00th=[   35], 10.00th=[   38], 20.00th=[   43],
     | 30.00th=[   48], 40.00th=[   53], 50.00th=[   59], 60.00th=[   71],
     | 70.00th=[  347], 80.00th=[  441], 90.00th=[  668], 95.00th=[  889],
     | 99.00th=[ 1336], 99.50th=[ 1582], 99.90th=[ 2606], 99.95th=[ 3621],
     | 99.99th=[ 8356]
   bw (  MiB/s): min= 1333, max= 1844, per=100.00%, avg=1611.48, stdev=12.68, samples=1794
   iops        : min=10666, max=14751, avg=12891.08, stdev=101.46, samples=1794
  write: IOPS=12.9k, BW=1610MiB/s (1689MB/s)(236GiB/150002msec); 0 zone resets
    slat (usec): min=17, max=30484, avg=49.91, stdev=98.81
    clat (usec): min=15, max=70020, avg=233.38, stdev=375.24
     lat (usec): min=39, max=70060, avg=283.29, stdev=395.40
    clat percentiles (usec):
     |  1.00th=[   27],  5.00th=[   29], 10.00th=[   31], 20.00th=[   37],
     | 30.00th=[   41], 40.00th=[   45], 50.00th=[   51], 60.00th=[   62],
     | 70.00th=[  347], 80.00th=[  441], 90.00th=[  668], 95.00th=[  898],
     | 99.00th=[ 1352], 99.50th=[ 1582], 99.90th=[ 2540], 99.95th=[ 3490],
     | 99.99th=[ 7504]
   bw (  MiB/s): min= 1326, max= 1906, per=100.00%, avg=1612.02, stdev=15.64, samples=1794
   iops        : min=10607, max=15253, avg=12895.43, stdev=125.15, samples=1794
  lat (usec)   : 10=0.01%, 20=0.01%, 50=42.09%, 100=22.23%, 250=2.39%
  lat (usec)   : 500=18.20%, 750=7.14%, 1000=4.52%
  lat (msec)   : 2=3.22%, 4=0.17%, 10=0.03%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=2.11%, sys=18.26%, ctx=1617600, majf=1, minf=89
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1931846,1932603,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=1610MiB/s (1688MB/s), 1610MiB/s-1610MiB/s (1688MB/s-1688MB/s), io=236GiB (253GB), run=150002-150002msec
  WRITE: bw=1610MiB/s (1689MB/s), 1610MiB/s-1610MiB/s (1689MB/s-1689MB/s), io=236GiB (253GB), run=150002-150002msec

Docker

root@TrueNAS02[/mnt/TANK_NVME/fio]# cat job.fio 
[test_random-rw]
filename=testfile
size=128G
#blocksize=1M
blocksize=128K
iodepth=2
rw=randrw
directory=/data
ioengine=libaio
numjobs=6
direct=1
group_reporting
time_based=1
runtime=150

root@TrueNAS02[/mnt/TANK_NVME/fio]# docker run --rm -v /mnt/TANK_NVME/fio/data:/data -v /mnt/TANK_NVME/fio/job.fio:/jobs/job.fio xridge/fio /jobs/job.fio
test_random-rw: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=2
...
fio-3.13
Starting 6 processes

test_random-rw: (groupid=0, jobs=6): err= 0: pid=20: Mon Aug  4 19:15:15 2025
  read: IOPS=12.3k, BW=1540MiB/s (1614MB/s)(226GiB/150001msec)
    slat (usec): min=13, max=28363, avg=424.41, stdev=371.31
    clat (usec): min=2, max=27690, avg=248.80, stdev=333.85
     lat (usec): min=37, max=28930, avg=673.95, stdev=527.35
    clat percentiles (usec):
     |  1.00th=[   30],  5.00th=[   36], 10.00th=[   39], 20.00th=[   44],
     | 30.00th=[   49], 40.00th=[   54], 50.00th=[   61], 60.00th=[   88],
     | 70.00th=[  371], 80.00th=[  469], 90.00th=[  685], 95.00th=[  857],
     | 99.00th=[ 1237], 99.50th=[ 1450], 99.90th=[ 2474], 99.95th=[ 3425],
     | 99.99th=[ 6456]
   bw (  MiB/s): min= 1219, max= 1706, per=99.93%, avg=1538.69, stdev=13.61, samples=1796
   iops        : min= 9754, max=13651, avg=12308.71, stdev=108.86, samples=1796
  write: IOPS=12.3k, BW=1540MiB/s (1615MB/s)(226GiB/150001msec); 0 zone resets
    slat (usec): min=18, max=18444, avg=52.16, stdev=99.91
    clat (usec): min=2, max=28370, avg=243.06, stdev=333.42
     lat (usec): min=40, max=28415, avg=295.62, stdev=357.33
    clat percentiles (usec):
     |  1.00th=[   28],  5.00th=[   29], 10.00th=[   31], 20.00th=[   38],
     | 30.00th=[   42], 40.00th=[   46], 50.00th=[   54], 60.00th=[   75],
     | 70.00th=[  371], 80.00th=[  465], 90.00th=[  676], 95.00th=[  857],
     | 99.00th=[ 1237], 99.50th=[ 1450], 99.90th=[ 2376], 99.95th=[ 3261],
     | 99.99th=[ 5997]
   bw (  MiB/s): min= 1165, max= 1802, per=99.94%, avg=1539.36, stdev=16.71, samples=1796
   iops        : min= 9324, max=14420, avg=12314.24, stdev=133.65, samples=1796
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=39.19%, 100=22.06%
  lat (usec)   : 250=3.05%, 500=18.07%, 750=9.84%, 1000=5.09%
  lat (msec)   : 2=2.54%, 4=0.13%, 10=0.03%, 20=0.01%, 50=0.01%
  cpu          : usr=1.99%, sys=17.68%, ctx=1812259, majf=8, minf=100
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1847656,1848375,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=1540MiB/s (1614MB/s), 1540MiB/s-1540MiB/s (1614MB/s-1614MB/s), io=226GiB (242GB), run=150001-150001msec
  WRITE: bw=1540MiB/s (1615MB/s), 1540MiB/s-1540MiB/s (1615MB/s-1615MB/s), io=226GiB (242GB), run=150001-150001msec

LXC Privileged

root@Debian-test:~# fio --name=test_random-rw --filename=./testfile --size=131072MiB --bs=128K --iodepth=2 --rw=randrw --direct=1 --numjobs=6 --runtime=150 --ioengine=libaio --time_based=1 --group_reporting
test_random-rw: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=2
...
fio-3.33
Starting 6 processes
test_random-rw: Laying out IO file (1 file / 125000MiB)
Jobs: 6 (f=6): [m(6)][100.0%][r=1542MiB/s,w=1527MiB/s][r=12.3k,w=12.2k IOPS][eta 00m:00s]
test_random-rw: (groupid=0, jobs=6): err= 0: pid=3261: Mon Aug  4 19:50:20 2025
  read: IOPS=12.6k, BW=1578MiB/s (1655MB/s)(231GiB/150001msec)
    slat (usec): min=14, max=64096, avg=414.66, stdev=557.87
    clat (usec): min=4, max=63721, avg=242.67, stdev=427.77
     lat (usec): min=39, max=65166, avg=657.33, stdev=736.40
    clat percentiles (usec):
     |  1.00th=[   31],  5.00th=[   36], 10.00th=[   39], 20.00th=[   44],
     | 30.00th=[   49], 40.00th=[   55], 50.00th=[   62], 60.00th=[   77],
     | 70.00th=[  359], 80.00th=[  441], 90.00th=[  644], 95.00th=[  873],
     | 99.00th=[ 1303], 99.50th=[ 1516], 99.90th=[ 2573], 99.95th=[ 4113],
     | 99.99th=[15533]
   bw (  MiB/s): min= 1335, max= 1785, per=100.00%, avg=1580.00, stdev=13.44, samples=1794
   iops        : min=10684, max=14286, avg=12639.25, stdev=107.56, samples=1794
  write: IOPS=12.6k, BW=1579MiB/s (1655MB/s)(231GiB/150001msec); 0 zone resets
    slat (usec): min=18, max=48405, avg=50.45, stdev=126.15
    clat (usec): min=4, max=64106, avg=238.23, stdev=463.79
     lat (usec): min=42, max=64160, avg=288.68, stdev=487.34
    clat percentiles (usec):
     |  1.00th=[   28],  5.00th=[   30], 10.00th=[   31], 20.00th=[   37],
     | 30.00th=[   42], 40.00th=[   46], 50.00th=[   53], 60.00th=[   67],
     | 70.00th=[  359], 80.00th=[  441], 90.00th=[  652], 95.00th=[  873],
     | 99.00th=[ 1319], 99.50th=[ 1549], 99.90th=[ 2606], 99.95th=[ 4015],
     | 99.99th=[19006]
   bw (  MiB/s): min= 1292, max= 1866, per=100.00%, avg=1580.50, stdev=16.69, samples=1794
   iops        : min=10342, max=14926, avg=12643.38, stdev=133.57, samples=1794
  lat (usec)   : 10=0.01%, 20=0.01%, 50=39.71%, 100=23.11%, 250=2.37%
  lat (usec)   : 500=20.13%, 750=7.19%, 1000=4.36%
  lat (msec)   : 2=2.94%, 4=0.14%, 10=0.03%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=2.10%, sys=18.27%, ctx=1620915, majf=4, minf=169
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1893981,1894540,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=1578MiB/s (1655MB/s), 1578MiB/s-1578MiB/s (1655MB/s-1655MB/s), io=231GiB (248GB), run=150001-150001msec
  WRITE: bw=1579MiB/s (1655MB/s), 1579MiB/s-1579MiB/s (1655MB/s-1655MB/s), io=231GiB (248GB), run=150001-150001msec

Unprivileged LXC (Bind Mount)

root@Debian-test:/data# fio --name=test_random-rw --filename=/data/testfile --size=131072MiB --bs=128K --iodepth=2 --rw=randrw --direct=1 --numjobs=6 --runtime=150 --ioengine=libaio --time_based=1 --group_reporting
test_random-rw: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=2
...
fio-3.33
Starting 6 processes
test_random-rw: Laying out IO file (1 file / 125000MiB)
Jobs: 6 (f=6): [m(6)][100.0%][r=1592MiB/s,w=1584MiB/s][r=12.7k,w=12.7k IOPS][eta 00m:00s]
test_random-rw: (groupid=0, jobs=6): err= 0: pid=3345: Mon Aug  4 20:36:26 2025
  read: IOPS=13.0k, BW=1621MiB/s (1699MB/s)(237GiB/150001msec)
    slat (usec): min=14, max=71657, avg=404.52, stdev=586.79
    clat (usec): min=4, max=59979, avg=235.74, stdev=442.21
     lat (usec): min=38, max=71722, avg=640.26, stdev=763.75
    clat percentiles (usec):
     |  1.00th=[   30],  5.00th=[   35], 10.00th=[   39], 20.00th=[   43],
     | 30.00th=[   49], 40.00th=[   54], 50.00th=[   60], 60.00th=[   71],
     | 70.00th=[  347], 80.00th=[  441], 90.00th=[  644], 95.00th=[  857],
     | 99.00th=[ 1270], 99.50th=[ 1483], 99.90th=[ 2507], 99.95th=[ 4178],
     | 99.99th=[17171]
   bw (  MiB/s): min= 1394, max= 1876, per=100.00%, avg=1622.42, stdev=13.22, samples=1794
   iops        : min=11156, max=15010, avg=12978.68, stdev=105.84, samples=1794
  write: IOPS=13.0k, BW=1621MiB/s (1700MB/s)(237GiB/150001msec); 0 zone resets
    slat (usec): min=17, max=42662, avg=48.53, stdev=130.77
    clat (usec): min=18, max=71666, avg=232.59, stdev=484.24
     lat (usec): min=42, max=71710, avg=281.12, stdev=506.53
    clat percentiles (usec):
     |  1.00th=[   28],  5.00th=[   30], 10.00th=[   32], 20.00th=[   38],
     | 30.00th=[   42], 40.00th=[   46], 50.00th=[   52], 60.00th=[   63],
     | 70.00th=[  347], 80.00th=[  441], 90.00th=[  652], 95.00th=[  865],
     | 99.00th=[ 1287], 99.50th=[ 1516], 99.90th=[ 2573], 99.95th=[ 4228],
     | 99.99th=[19530]
   bw (  MiB/s): min= 1364, max= 1894, per=100.00%, avg=1622.95, stdev=15.91, samples=1794
   iops        : min=10912, max=15154, avg=12982.98, stdev=127.33, samples=1794
  lat (usec)   : 10=0.01%, 20=0.01%, 50=40.61%, 100=24.16%, 250=2.10%
  lat (usec)   : 500=18.11%, 750=7.71%, 1000=4.44%
  lat (msec)   : 2=2.69%, 4=0.12%, 10=0.03%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=2.23%, sys=18.64%, ctx=1550350, majf=1, minf=85
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1944921,1945461,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=1621MiB/s (1699MB/s), 1621MiB/s-1621MiB/s (1699MB/s-1699MB/s), io=237GiB (255GB), run=150001-150001msec
  WRITE: bw=1621MiB/s (1700MB/s), 1621MiB/s-1621MiB/s (1700MB/s-1700MB/s), io=237GiB (255GB), run=150001-150001msec

Unprivileged LXC (Container RootFS)

root@ElastiFlowVA:~# fio --name=test_random-rw --filename=./testfile --size=131072MiB --bs=128K --iodepth=2 --rw=randrw --direct=1 --numjobs=6 --runtime=150 --ioengine=libaio --time_based=1 --group_reporting
test_random-rw: (g=0): rw=randrw, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=2
...
fio-3.36
Starting 6 processes
test_random-rw: Laying out IO file (1 file / 125000MiB)
Jobs: 6 (f=6): [m(6)][100.0%][r=186MiB/s,w=160MiB/s][r=1486,w=1279 IOPS][eta 00m:00s]
test_random-rw: (groupid=0, jobs=6): err= 0: pid=949: Mon Aug  4 19:05:56 2025
  read: IOPS=1468, BW=184MiB/s (192MB/s)(26.9GiB/150004msec)
    slat (usec): min=25, max=107596, avg=4004.25, stdev=2532.33
    clat (usec): min=5, max=107608, avg=2039.40, stdev=2637.31
     lat (usec): min=66, max=111596, avg=6043.65, stdev=3698.33
    clat percentiles (usec):
     |  1.00th=[   49],  5.00th=[   60], 10.00th=[   67], 20.00th=[   77],
     | 30.00th=[   88], 40.00th=[   97], 50.00th=[  117], 60.00th=[ 2343],
     | 70.00th=[ 3556], 80.00th=[ 4490], 90.00th=[ 5669], 95.00th=[ 6587],
     | 99.00th=[ 8848], 99.50th=[ 9765], 99.90th=[12649], 99.95th=[26084],
     | 99.99th=[42730]
   bw (  KiB/s): min=126208, max=271616, per=100.00%, avg=188050.30, stdev=4032.07, samples=1794
   iops        : min=  986, max= 2122, avg=1469.12, stdev=31.50, samples=1794
  write: IOPS=1467, BW=183MiB/s (192MB/s)(26.9GiB/150004msec); 0 zone resets
    slat (usec): min=28, max=30134, avg=65.97, stdev=186.15
    clat (usec): min=32, max=83930, avg=2057.78, stdev=2699.25
     lat (usec): min=70, max=83983, avg=2123.75, stdev=2705.81
    clat percentiles (usec):
     |  1.00th=[   44],  5.00th=[   46], 10.00th=[   48], 20.00th=[   52],
     | 30.00th=[   61], 40.00th=[   70], 50.00th=[   86], 60.00th=[ 2409],
     | 70.00th=[ 3621], 80.00th=[ 4555], 90.00th=[ 5735], 95.00th=[ 6652],
     | 99.00th=[ 8979], 99.50th=[ 9896], 99.90th=[15270], 99.95th=[32375],
     | 99.99th=[41681]
   bw (  KiB/s): min=99072, max=300800, per=100.00%, avg=188025.06, stdev=5664.41, samples=1794
   iops        : min=  774, max= 2350, avg=1468.92, stdev=44.25, samples=1794
  lat (usec)   : 10=0.01%, 20=0.01%, 50=8.96%, 100=38.89%, 250=7.89%
  lat (usec)   : 500=0.55%, 750=0.27%, 1000=0.18%
  lat (msec)   : 2=1.80%, 4=15.68%, 10=25.35%, 20=0.35%, 50=0.07%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=0.51%, sys=3.52%, ctx=198858, majf=0, minf=82
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=220208,220115,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=184MiB/s (192MB/s), 184MiB/s-184MiB/s (192MB/s-192MB/s), io=26.9GiB (28.9GB), run=150004-150004msec
  WRITE: bw=183MiB/s (192MB/s), 183MiB/s-183MiB/s (192MB/s-192MB/s), io=26.9GiB (28.8GB), run=150004-150004msec
1 Like

That’s the configuration I had tested. I don’t know about the RootFS, never tried that.

And your result there, of not quite 13k r/w IOPS, including when running directly on ZFS, is what I mean when I say “terrible”. PCIe 3 consumer drives can get 95k/30k or even 115k/40k in an fio test on ext4 or xfs - one that stresses the drive with 150GB test size and 75% r/w distribution using 4k blocks. IOPS is just an approximation of what my DB actually cares about, which is latency.

And I don’t have a consumer drive, I have a data center drive in here. It should do great, if it wasn’t for ZFS.

I didn’t really expect ZFS to have gotten respectable w/ regards to latency on NVMe. It’s not what it does. Maybe “yet” - I’m curious to try again when DirectIO is available.

Commands I use for testing:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=150G --readwrite=randrw --rwmixread=75; rm test and sudo ioping -D -c 30 /dev/<ssd-device> when the device is under stress. I haven’t recorded results of that on ZFS, I just tried the actual app.

I should also say - this is a test for curiosity. I don’t mean to actually run the app in production like that. If I did, I could always create a VM with a passed through physical drive and format the drive with xfs. It’s just to see how ZFS is coming along on NVMe.

To be fair,
This is running on lab box, made from 10+ year old desktop parts:

  • Intel Core i7-5820K
  • DDR4-2400
  • NVMe drives behind a PLX 8747 switch based SFF-8643 adapter (to compensate for no bifurcation on motherboard)

Also, I was using the default 128K recordsize, and tuning the iodepth for best latency, not IOPS.

A couple tweaks:
zfs set recordsize=4k "$TEST_DATASET"
zfs set compression=off "$TEST_DATASET"

and bumping up concurrency, at the price of latency, and it’ll do 46k r/w IOPS:

Metric IOdepth = 2 IOdepth = 64
BW (Read/Write) 125 / 125 MiB/s 182 / 182 MiB/s
IOPS (Read/Write) 32k / 32k 46.6k / 46.6k
Avg Latency (R/W) 244µs / 127µs 4138µs / 4102µs
Std Dev Latency (R/W) 748µs / 469µs 7278µs / 7197µs
99% Latency (R/W) 570µs / 570µs 36963µs / 36963µs
CPU Usage (usr/sys) 3.68% / 26.11% 4.88% / 30.80%

Raw fio results

IO depth = 2

root@TrueNAS02[/mnt/TANK_NVME/fio]# fio --name=test_random-rw --filename=./testfile --size=128G --bs=4K --iodepth=2 --rw=randrw --direct=1 --numjobs=6 --runtime=150 --ioengine=libaio --time_based=1 --group_reporting
test_random-rw: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=2
...
fio-3.33
Starting 6 processes
test_random-rw: Laying out IO file (1 file / 131072MiB)
Jobs: 6 (f=6): [m(6)][100.0%][r=136MiB/s,w=134MiB/s][r=34.8k,w=34.4k IOPS][eta 00m:00s]
test_random-rw: (groupid=0, jobs=6): err= 0: pid=144831: Tue Aug  5 13:10:42 2025
  read: IOPS=32.0k, BW=125MiB/s (131MB/s)(18.3GiB/150013msec)
    slat (usec): min=2, max=63718, avg=144.82, stdev=533.22
    clat (usec): min=5, max=66835, avg=99.40, stdev=439.35
     lat (usec): min=9, max=82544, avg=244.22, stdev=748.23
    clat percentiles (usec):
     |  1.00th=[   12],  5.00th=[   16], 10.00th=[   19], 20.00th=[   23],
     | 30.00th=[   25], 40.00th=[   29], 50.00th=[   33], 60.00th=[   63],
     | 70.00th=[  110], 80.00th=[  121], 90.00th=[  180], 95.00th=[  260],
     | 99.00th=[  570], 99.50th=[ 1037], 99.90th=[ 5604], 99.95th=[ 8586],
     | 99.99th=[19006]
   bw (  KiB/s): min=35320, max=184152, per=100.00%, avg=127865.13, stdev=5071.75, samples=1794
   iops        : min= 8829, max=46036, avg=31965.75, stdev=1267.90, samples=1794
  write: IOPS=32.0k, BW=125MiB/s (131MB/s)(18.3GiB/150013msec); 0 zone resets
    slat (usec): min=5, max=55324, avg=34.03, stdev=201.43
    clat (usec): min=5, max=61157, avg=93.31, stdev=407.77
     lat (usec): min=14, max=61320, avg=127.34, stdev=469.03
    clat percentiles (usec):
     |  1.00th=[   12],  5.00th=[   13], 10.00th=[   14], 20.00th=[   17],
     | 30.00th=[   20], 40.00th=[   23], 50.00th=[   27], 60.00th=[   50],
     | 70.00th=[  108], 80.00th=[  119], 90.00th=[  178], 95.00th=[  260],
     | 99.00th=[  570], 99.50th=[  938], 99.90th=[ 4883], 99.95th=[ 7767],
     | 99.99th=[17695]
   bw (  KiB/s): min=34872, max=187505, per=100.00%, avg=127888.75, stdev=5114.16, samples=1794
   iops        : min= 8716, max=46876, avg=31971.74, stdev=1278.51, samples=1794
  lat (usec)   : 10=0.25%, 20=21.02%, 50=37.78%, 100=4.38%, 250=31.24%
  lat (usec)   : 500=4.08%, 750=0.59%, 1000=0.18%
  lat (msec)   : 2=0.22%, 4=0.13%, 10=0.10%, 20=0.03%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=3.68%, sys=26.11%, ctx=4298190, majf=0, minf=81
  IO depths    : 1=0.1%, 2=100.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=4793517,4794559,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=2

Run status group 0 (all jobs):
   READ: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=18.3GiB (19.6GB), run=150013-150013msec
  WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=18.3GiB (19.6GB), run=150013-150013msec

IO depth = 64

root@TrueNAS02[/mnt/TANK_NVME/fio]# fio --name=test_random-rw --filename=./testfile --size=128G --bs=4K --iodepth=64 --rw=randrw --direct=1 --numjobs=6 --runtime=150 --ioengine=libaio --time_based=1 --group_reporting
test_random-rw: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
...
fio-3.33
Starting 6 processes
Jobs: 6 (f=6): [m(6)][100.0%][r=112MiB/s,w=112MiB/s][r=28.6k,w=28.7k IOPS][eta 00m:00s]
test_random-rw: (groupid=0, jobs=6): err= 0: pid=145343: Tue Aug  5 13:15:54 2025
  read: IOPS=46.6k, BW=182MiB/s (191MB/s)(26.7GiB/150001msec)
    slat (usec): min=3, max=80735, avg=77.82, stdev=504.92
    clat (usec): min=2, max=288819, avg=4060.32, stdev=7155.52
     lat (usec): min=14, max=289236, avg=4138.14, stdev=7278.49
    clat percentiles (usec):
     |  1.00th=[   775],  5.00th=[   930], 10.00th=[  1123], 20.00th=[  1221],
     | 30.00th=[  1270], 40.00th=[  1352], 50.00th=[  1762], 60.00th=[  3195],
     | 70.00th=[  3720], 80.00th=[  5014], 90.00th=[  7242], 95.00th=[ 12125],
     | 99.00th=[ 36963], 99.50th=[ 51643], 99.90th=[ 85459], 99.95th=[101188],
     | 99.99th=[133694]
   bw (  KiB/s): min=21755, max=577220, per=100.00%, avg=186382.48, stdev=21207.80, samples=1788
   iops        : min= 5437, max=144304, avg=46594.71, stdev=5301.94, samples=1788
  write: IOPS=46.6k, BW=182MiB/s (191MB/s)(26.7GiB/150001msec); 0 zone resets
    slat (usec): min=5, max=114100, avg=41.26, stdev=428.89
    clat (usec): min=2, max=288952, avg=4060.56, stdev=7156.50
     lat (usec): min=12, max=289491, avg=4101.82, stdev=7196.63
    clat percentiles (usec):
     |  1.00th=[   775],  5.00th=[   930], 10.00th=[  1123], 20.00th=[  1221],
     | 30.00th=[  1270], 40.00th=[  1352], 50.00th=[  1762], 60.00th=[  3195],
     | 70.00th=[  3720], 80.00th=[  5014], 90.00th=[  7242], 95.00th=[ 12125],
     | 99.00th=[ 36963], 99.50th=[ 51643], 99.90th=[ 85459], 99.95th=[100140],
     | 99.99th=[129500]
   bw (  KiB/s): min=22032, max=575556, per=100.00%, avg=186375.50, stdev=21198.92, samples=1788
   iops        : min= 5506, max=143888, avg=46593.00, stdev=5299.70, samples=1788
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 250=0.01%
  lat (usec)   : 500=0.01%, 750=0.26%, 1000=5.74%
  lat (msec)   : 2=45.57%, 4=22.42%, 10=19.79%, 20=3.70%, 50=1.97%
  lat (msec)   : 100=0.49%, 250=0.05%, 500=0.01%
  cpu          : usr=4.88%, sys=30.80%, ctx=2705323, majf=0, minf=97
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=6986612,6986331,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=182MiB/s (191MB/s), 182MiB/s-182MiB/s (191MB/s-191MB/s), io=26.7GiB (28.6GB), run=150001-150001msec
  WRITE: bw=182MiB/s (191MB/s), 182MiB/s-182MiB/s (191MB/s-191MB/s), io=26.7GiB (28.6GB), run=150001-150001msec