Poor write performance with Intel DCPMM Memory for SLOG

HI all. I’m building an new sever to replace my trusty Dell T620.

New server is based on ElectricEel-24.10-RC.1:

  • Supermicro X11SPW-TF LGA 3647 Socket P Motherboard
  • Intel Xeon Gold 6230 2.1GHz 27.5MB 20-Core 125W
  • 4 x 64GB 4DRX4 PC4-2666V LRDIMM Samsung
  • 2 x Intel Optane 128GB DDR4 PC4-2666 288p DCPMM Persistent Memory NMA1XXD128GPS
  • LSI SAS9300-8i HBA
  • LSI SAS9300-8e HBA

Am I wrong in thinking that intel DCPMM should give a SLOG new RAM write speeds?

I have a test pool made up of 7x2 way SAS SSD mirrors:

root@truenas[/mnt/flash]# zpool status flash
  pool: flash
 state: ONLINE
config:

        NAME                                      STATE     READ WRITE CKSUM
        flash                                     ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            dfb03ef0-460e-487c-947e-b98912d2dfab  ONLINE       0     0     0
            fdd03cb4-1337-4c7c-8aff-5069c8a556a0  ONLINE       0     0     0
          mirror-1                                ONLINE       0     0     0
            565b77a8-7225-4213-9591-07cd7e5ca576  ONLINE       0     0     0
            8342d232-3534-4118-9b6c-6040ab3f979b  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            fd50dd19-1339-43b9-bc58-e6cbc66fcdac  ONLINE       0     0     0
            6860f72f-5257-4b06-bdc3-b6455f1d5366  ONLINE       0     0     0
          mirror-3                                ONLINE       0     0     0
            99cfae17-c447-4f08-832d-687e508ad740  ONLINE       0     0     0
            5a08d706-36cc-402f-9ea2-c65c14ee5139  ONLINE       0     0     0
          mirror-4                                ONLINE       0     0     0
            34aaef9a-f31d-446a-ae67-a55e1beabef9  ONLINE       0     0     0
            b12d1d67-da9b-4258-a293-ef6104940927  ONLINE       0     0     0
          mirror-5                                ONLINE       0     0     0
            b7cd53dc-4453-4dfe-83ea-0e0249734e1c  ONLINE       0     0     0
            7536085a-7243-4f43-902a-58406403d872  ONLINE       0     0     0
          mirror-6                                ONLINE       0     0     0
            abf484d8-d5d2-4ba6-97d7-26aa99589e3d  ONLINE       0     0     0
            f999e01f-1a6f-4b3e-b49d-8af9cfe142a1  ONLINE       0     0     0
        logs
          mirror-10                               ONLINE       0     0     0
            88103c20-9249-446e-8d64-2857d761bf3c  ONLINE       0     0     0
            6c532345-1d3f-4926-8ab2-8d4de414a52b  ONLINE       0     0     0
        spares
          b5b1ab22-b4c0-4b84-ba43-d41aed0c35e3    AVAIL

Lets get the baseline of the pool with sync off:

root@truenas[/mnt/flash/test]# fio --name=write --rw=write -direct=1 --ioengine=libaio --bs=4k --numjobs=16 --size=32G --runtime=600 --group_reporting
write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.33
Starting 16 processes
Jobs: 5 (f=2): [_(6),W(1),f(1),_(3),f(1),W(1),_(1),f(1),_(1)][100.0%][w=3844MiB/s][w=984k IOPS][eta 00m:00s]
write: (groupid=0, jobs=16): err= 0: pid=13929: Fri Oct  4 05:42:19 2024
  write: IOPS=1088k, BW=4250MiB/s (4456MB/s)(512GiB/123369msec); 0 zone resets
    slat (usec): min=2, max=29640, avg=13.42, stdev=43.49
    clat (nsec): min=330, max=21491k, avg=738.77, stdev=6810.97
     lat (usec): min=2, max=29645, avg=14.16, stdev=44.14
    clat percentiles (nsec):
     |  1.00th=[  394],  5.00th=[  426], 10.00th=[  454], 20.00th=[  470],
     | 30.00th=[  490], 40.00th=[  532], 50.00th=[  708], 60.00th=[  836],
     | 70.00th=[  908], 80.00th=[  980], 90.00th=[ 1080], 95.00th=[ 1144],
     | 99.00th=[ 1304], 99.50th=[ 1384], 99.90th=[ 1624], 99.95th=[ 3952],
     | 99.99th=[13632]
   bw (  MiB/s): min= 3473, max= 5333, per=100.00%, avg=4262.05, stdev=19.91, samples=3912
   iops        : min=889155, max=1365259, avg=1091084.89, stdev=5096.65, samples=3912
  lat (nsec)   : 500=34.49%, 750=18.05%, 1000=30.32%
  lat (usec)   : 2=17.07%, 4=0.02%, 10=0.03%, 20=0.02%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=7.89%, sys=89.95%, ctx=93267, majf=0, minf=179
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,134217728,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=4250MiB/s (4456MB/s), 4250MiB/s-4250MiB/s (4456MB/s-4456MB/s), io=512GiB (550GB), run=123369-123369msec

Now with sync on:

root@truenas[/mnt/flash/test]# fio --name=write --rw=write -direct=1 --ioengine=libaio --bs=4k --numjobs=16 --size=32G --runtime=600 --group_reporting
write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.33
Starting 16 processes
Jobs: 16 (f=16): [W(16)][100.0%][w=200MiB/s][w=51.2k IOPS][eta 00m:00s]
write: (groupid=0, jobs=16): err= 0: pid=15747: Fri Oct  4 05:55:35 2024
  write: IOPS=51.9k, BW=203MiB/s (213MB/s)(119GiB/600002msec); 0 zone resets
    slat (usec): min=80, max=96997, avg=301.53, stdev=608.30
    clat (nsec): min=776, max=40122k, avg=3883.38, stdev=8117.72
     lat (usec): min=81, max=97007, avg=305.41, stdev=608.44
    clat percentiles (nsec):
     |  1.00th=[ 1880],  5.00th=[ 2992], 10.00th=[ 3056], 20.00th=[ 3184],
     | 30.00th=[ 3312], 40.00th=[ 3536], 50.00th=[ 3792], 60.00th=[ 4016],
     | 70.00th=[ 4256], 80.00th=[ 4512], 90.00th=[ 4768], 95.00th=[ 4960],
     | 99.00th=[ 5856], 99.50th=[ 7200], 99.90th=[15936], 99.95th=[19840],
     | 99.99th=[29568]
   bw (  KiB/s): min=164940, max=303768, per=100.00%, avg=207883.21, stdev=781.43, samples=19184
   iops        : min=41235, max=75942, avg=51969.94, stdev=195.34, samples=19184
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=1.10%, 4=58.00%, 10=40.63%, 20=0.22%, 50=0.05%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 50=0.01%
  cpu          : usr=2.29%, sys=29.88%, ctx=59251534, majf=0, minf=164
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,31167037,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=119GiB (128GB), run=600002-600002msec

Sync OFF 1088k IOPS vs Sync ON 51.9k IOPS

Can anyone help me understand how there is such a performance drop using PMEM. It has to be a config issue.

PMEM setup as a 1x2 mirror sync off:

root@truenas[/mnt/pmem/test]# fio --name=write --rw=write -direct=1 --ioengine=libaio --bs=4k --numjobs=16 --size=32G --runtime=600 --group_reporting
write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.33
Starting 16 processes
Jobs: 5 (f=4): [_(2),W(2),_(3),f(1),W(1),_(1),W(1),_(5)][98.5%][w=2435MiB/s][w=623k IOPS][eta 00m:03s]
write: (groupid=0, jobs=16): err= 0: pid=19381: Fri Oct  4 06:04:15 2024
  write: IOPS=687k, BW=2684MiB/s (2814MB/s)(512GiB/195331msec); 0 zone resets
    slat (usec): min=2, max=75317, avg=21.84, stdev=114.52
    clat (nsec): min=344, max=35528k, avg=765.58, stdev=9771.59
     lat (usec): min=2, max=75361, avg=22.60, stdev=115.47
    clat percentiles (nsec):
     |  1.00th=[  406],  5.00th=[  438], 10.00th=[  466], 20.00th=[  486],
     | 30.00th=[  524], 40.00th=[  548], 50.00th=[  588], 60.00th=[  732],
     | 70.00th=[  836], 80.00th=[  908], 90.00th=[ 1004], 95.00th=[ 1096],
     | 99.00th=[ 2736], 99.50th=[ 4256], 99.90th=[10432], 99.95th=[14784],
     | 99.99th=[47360]
   bw (  MiB/s): min= 1959, max= 5115, per=100.00%, avg=2701.48, stdev=23.47, samples=6173
   iops        : min=501516, max=1309535, avg=691576.89, stdev=6009.59, samples=6173
  lat (nsec)   : 500=24.44%, 750=36.69%, 1000=28.48%
  lat (usec)   : 2=8.88%, 4=0.95%, 10=0.44%, 20=0.08%, 50=0.02%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=5.06%, sys=89.37%, ctx=125523, majf=0, minf=181
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,134217728,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2684MiB/s (2814MB/s), 2684MiB/s-2684MiB/s (2814MB/s-2814MB/s), io=512GiB (550GB), run=195331-195331msec

PMEM setup as a 2 in a stripe sync off:

root@truenas[/mnt/pmem/test]# fio --name=write --rw=write -direct=1 --ioengine=libaio --bs=4k --numjobs=16 --size=32G --runtime=600 --group_reporting
write: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
...
fio-3.33
Starting 16 processes
Jobs: 6 (f=4): [f(1),_(2),W(1),_(2),f(1),W(2),_(5),W(1),_(1)][99.4%][w=3198MiB/s][w=819k IOPS][eta 00m:01s]
write: (groupid=0, jobs=16): err= 0: pid=21751: Fri Oct  4 06:10:32 2024
  write: IOPS=861k, BW=3362MiB/s (3526MB/s)(512GiB/155931msec); 0 zone resets
    slat (usec): min=2, max=68208, avg=17.25, stdev=139.49
    clat (nsec): min=352, max=33229k, avg=742.20, stdev=9781.68
     lat (usec): min=3, max=68219, avg=17.99, stdev=140.40
    clat percentiles (nsec):
     |  1.00th=[  442],  5.00th=[  478], 10.00th=[  498], 20.00th=[  524],
     | 30.00th=[  540], 40.00th=[  556], 50.00th=[  596], 60.00th=[  716],
     | 70.00th=[  828], 80.00th=[  900], 90.00th=[  988], 95.00th=[ 1064],
     | 99.00th=[ 1400], 99.50th=[ 2352], 99.90th=[ 7840], 99.95th=[12480],
     | 99.99th=[36608]
   bw (  MiB/s): min= 2702, max= 4681, per=100.00%, avg=3377.29, stdev=17.98, samples=4940
   iops        : min=691777, max=1198349, avg=864584.21, stdev=4603.01, samples=4940
  lat (nsec)   : 500=10.05%, 750=52.11%, 1000=28.93%
  lat (usec)   : 2=8.29%, 4=0.38%, 10=0.17%, 20=0.05%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=6.31%, sys=84.74%, ctx=125002, majf=0, minf=176
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,134217728,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=3362MiB/s (3526MB/s), 3362MiB/s-3362MiB/s (3526MB/s-3526MB/s), io=512GiB (550GB), run=155931-155931msec

It’s clear that the Intel PMEM modules can preform well. But they are only preforming at a fraction of the capability as a SLOG wich make me thing that the default tuning parameter are too conservative.

If I try and change any of the zfs tunables I get permission denied errors:

root@truenas[~]# echo 25 >> /sys/module/zfs/parameters/zfs_dirty_data_max_percent
zsh: permission denied: /sys/module/zfs/parameters/zfs_dirty_data_max_percent

Any help would be appreciated. How can I change the tunables in TrueNAS Scale?

Thanks,
Simon