Performance baselining & comparisons

cerbie · September 7, 2024, 7:54pm

I have recently built a server and installed truenas scale Dragonfish-24.04.2 with the aim to utilise it as a NAS. The performance I am seeing is significantly lower than expected, however the biggest question I have has been trying to see if this is just zfs performing at that level due to its inherent overheads and thus expected/normal or if there is some kind of underlying issue that I could solve,

I have run several fio tests, directly on the box as well as on mounted filesystems over SMB and NFS to test real-life sync/async performance over the network…

Everywhere I have found a thread with some kind of performance related advice, the fio command(s) suggested all have different options (in some cases wildly so) making comparisons pointless.

Is there some kind of standardised testing that can be used to baseline and compare results for an installation that I have been missing?

I have set up a pool of 4x 2-wide mirrored vdevs using SATA SSDs (Samsung PM883) and a 2-wide mirror vdev with Optane p4800x for slog. This array is meant for storing VM images so sync performance is a consideration. For the fio examples I have been running the performance I am seeing is quite poor (best case just over 1GB/sec for higher queue depths, as low as 120-ish MB/sec for low queue depths). These performance figures are direct on the box and not over the network.

If this aligns with other people’s experiences, that’s great, I can manage expectations, but I have trouble finding a frame of reference and therefore struggling to even confirm if I have a problem in the first place.

I appreciate everyone’s setup and use cases are different, but I am looking for a baseline as a starting point. Thoughts welcome.

Davvo · September 7, 2024, 8:00pm

What is exaclty the fio command you are using to benchmark? Please list your hardware and provide more information about your system: running zpool status, zpool list and zfs list would help.

cerbie · September 7, 2024, 8:26pm

Hi, thanks for your response. My question was primarily focused on identifying baselines to establish expected performance profiles but happy to provide the information. Please see below:

HW Specs:
MB: SuperMicro H12SSL-i
CPU: Epyc 7282 16C/32T
RAM: 256GB RDIMM 2666MHz ECC
PSU: Seasonic Vertex PX-1000
Boot: Mirror of 2x Micron 7450 Pro NVME 960GB
No VMs/Apps
SSD Pool: 8x Samsung PM883 3.8TB SATA (connected to motherboard onboard controller via slimsas cable)
SLOG for SSD Pool: 2x Intel Optane p4800x PCIE
NIC: 1x Intel x550-T2
HBA: Broadcom 7600-24i connected to backplane - no HDDs currently

I have run a number of instances if the following each with rw=seqread, seqwrite, randread, randwrite. I have also used blocksize of 8k to 1M:

for i in 1 8 16 32; do fio --name=fiotest --directory=/mnt/ssd_pool/xxxx/ --ioengine=libaio --direct=1 --numjobs=2 --nrfiles=4 --runtime=30 --group_reporting --time_based --stonewall --size=4G --ramp_time=20 --bs=1M --rw=randwrite --iodepth=$i --fallocate=none --output=/home/admin/$(uname -n)-randwrite-$i; done

zpool status output:

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:05 with 0 errors on Fri Sep  6 03:45:06 2024
config:

        NAME           STATE     READ WRITE CKSUM
        boot-pool      ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme3n1p3  ONLINE       0     0     0
            nvme2n1p3  ONLINE       0     0     0

errors: No known data errors

  pool: ssd_pool
 state: ONLINE
config:

        NAME                                      STATE     READ WRITE CKSUM
        ssd_pool                                  ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            8b8d0458-6fa5-43ee-bc99-47f11b0f55ec  ONLINE       0     0     0
            391abe27-bd74-4d76-b3a1-6b1542d4008e  ONLINE       0     0     0
          mirror-1                                ONLINE       0     0     0
            1bf12c44-7571-4055-be8e-9274ab882f0f  ONLINE       0     0     0
            9c01776c-52b4-4753-89e7-ca59f78974a8  ONLINE       0     0     0
          mirror-2                                ONLINE       0     0     0
            32c20de8-d1af-40a6-a4ea-b5a10b849b6c  ONLINE       0     0     0
            57ce8664-c692-4845-97a0-43f74032b3d4  ONLINE       0     0     0
          mirror-3                                ONLINE       0     0     0
            642e86b6-d77b-4cf3-a2bb-4cb828a1e224  ONLINE       0     0     0
            a5b93e4c-c73d-49cf-a804-cc95fbfee47a  ONLINE       0     0     0
        logs
          mirror-4                                ONLINE       0     0     0
            53a1cbad-bb04-4939-8b89-b1126546b3a0  ONLINE       0     0     0
            b8be8f0b-9725-4362-9a75-5dcd9f850f08  ONLINE       0     0     0

errors: No known data errors

zpool list output:

NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
boot-pool   872G  3.18G   869G        -         -     0%     0%  1.00x    ONLINE  -
ssd_pool   13.9T  7.78G  13.9T        -         -     0%     0%  1.00x    ONLINE  /mnt

zfs list output:

NAME                                                         USED  AVAIL  REFER  MOUNTPOINT
boot-pool                                                   3.18G   842G    96K  none
boot-pool/.system                                            560M   842G   112K  legacy
boot-pool/.system/configs-ae32c386e13840b2bf9c0083275e7941  1.34M   842G  1.34M  legacy
boot-pool/.system/cores                                       96K  1024M    96K  legacy
boot-pool/.system/netdata-ae32c386e13840b2bf9c0083275e7941   558M   842G   558M  legacy
boot-pool/.system/samba4                                     320K   842G   320K  legacy
boot-pool/ROOT                                              2.61G   842G    96K  none
boot-pool/ROOT/24.04.2                                      2.61G   842G   164M  legacy
boot-pool/ROOT/24.04.2/audit                                 328K   842G   328K  /audit
boot-pool/ROOT/24.04.2/conf                                  140K   842G   140K  /conf
boot-pool/ROOT/24.04.2/data                                  292K   842G   292K  /data
boot-pool/ROOT/24.04.2/etc                                  6.68M   842G  5.64M  /etc
boot-pool/ROOT/24.04.2/home                                 63.3M   842G  63.3M  /home
boot-pool/ROOT/24.04.2/mnt                                    96K   842G    96K  /mnt
boot-pool/ROOT/24.04.2/opt                                  74.1M   842G  74.1M  /opt
boot-pool/ROOT/24.04.2/root                                  188K   842G   188K  /root
boot-pool/ROOT/24.04.2/usr                                  2.12G   842G  2.12G  /usr
boot-pool/ROOT/24.04.2/var                                  49.7M   842G  32.4M  /var
boot-pool/ROOT/24.04.2/var/ca-certificates                    96K   842G    96K  /var/local/ca-certificates
boot-pool/ROOT/24.04.2/var/log                              16.3M   842G  16.3M  /var/log
boot-pool/grub                                              8.23M   842G  8.23M  legacy
ssd_pool                                                    7.78G  13.8T   240K  /mnt/ssd_pool
ssd_pool/xxxxxxx                                             192K  13.8T   192K  /mnt/ssd_pool/xxxxxxx
ssd_pool/vm_ssd                                             7.77G  13.8T  7.77G  /mnt/ssd_pool/vm_ssd

Davvo · September 7, 2024, 8:50pm

I assume ssd_pool/xxxxxxx has sync=always set in the dataset’s properties, do you confirm?

If you run fio --name TEST directory=/mnt/ssd_pool/xxxx/ --eta-newline=5s --ioengine=posixaio --rw=write --size=300g --io_size=650g --blocksize=128k --iodepth=16 --numjobs=8 --runtime=120 --group_reporting which result you get?

cerbie · September 7, 2024, 9:21pm

EDIT: Added results with Optane Slog and without

Results for both:

In async this performs much better as expected with 8 jobs running, 128k bs and 16 iodepth. It’s hitting 2GB/sec which is what I would expect for this layout

TEST: (groupid=0, jobs=8): err= 0: pid=785181: Sat Sep  7 22:08:48 2024
  write: IOPS=16.0k, BW=2002MiB/s (2100MB/s)(235GiB/120008msec); 0 zone resets
    slat (nsec): min=1000, max=15458k, avg=4962.07, stdev=15026.34
    clat (usec): min=109, max=39507, avg=7971.02, stdev=1088.27
     lat (usec): min=117, max=39510, avg=7975.99, stdev=1088.08
    clat percentiles (usec):
     |  1.00th=[ 2089],  5.00th=[ 7767], 10.00th=[ 7832], 20.00th=[ 7898],
     | 30.00th=[ 7898], 40.00th=[ 7963], 50.00th=[ 7963], 60.00th=[ 7963],
     | 70.00th=[ 7963], 80.00th=[ 8029], 90.00th=[ 8291], 95.00th=[ 8586],
     | 99.00th=[ 9634], 99.50th=[10683], 99.90th=[22152], 99.95th=[25560],
     | 99.99th=[28967]
   bw (  MiB/s): min= 1806, max= 7010, per=100.00%, avg=2003.37, stdev=41.00, samples=1912
   iops        : min=14448, max=56082, avg=16026.95, stdev=328.00, samples=1912
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.10%, 1000=0.40%
  lat (msec)   : 2=0.48%, 4=0.24%, 10=98.12%, 20=0.53%, 50=0.13%
  cpu          : usr=1.70%, sys=0.47%, ctx=969751, majf=17, minf=182
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=50.1%, 16=49.8%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=95.7%, 8=1.9%, 16=2.4%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1922487,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=2002MiB/s (2100MB/s), 2002MiB/s-2002MiB/s (2100MB/s-2100MB/s), io=235GiB (252GB), run=120008-120008msec

With sync set to always it’s getting to 700-ish MB/sec with the Optane SLOG:

TEST: (groupid=0, jobs=8): err= 0: pid=787926: Sat Sep  7 22:18:58 2024
  write: IOPS=5682, BW=710MiB/s (745MB/s)(83.2GiB/120016msec); 0 zone resets
    slat (nsec): min=1641, max=1679.0k, avg=5267.34, stdev=5489.67
    clat (usec): min=4472, max=78257, avg=22499.75, stdev=5464.15
     lat (usec): min=4479, max=78262, avg=22505.02, stdev=5464.25
    clat percentiles (usec):
     |  1.00th=[12518],  5.00th=[14091], 10.00th=[16909], 20.00th=[20841],
     | 30.00th=[22152], 40.00th=[22676], 50.00th=[22938], 60.00th=[23200],
     | 70.00th=[23462], 80.00th=[23987], 90.00th=[24249], 95.00th=[24773],
     | 99.00th=[49546], 99.50th=[62653], 99.90th=[72877], 99.95th=[73925],
     | 99.99th=[74974]
   bw (  KiB/s): min=622592, max=906496, per=100.00%, avg=727323.94, stdev=6124.81, samples=1912
   iops        : min= 4864, max= 7082, avg=5682.17, stdev=47.85, samples=1912
  lat (msec)   : 10=0.01%, 20=17.28%, 50=81.78%, 100=0.93%
  cpu          : usr=0.71%, sys=0.24%, ctx=343281, majf=0, minf=182
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=50.0%, 16=50.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=95.8%, 8=1.7%, 16=2.5%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,681974,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=710MiB/s (745MB/s), 710MiB/s-710MiB/s (745MB/s-745MB/s), io=83.2GiB (89.4GB), run=120016-120016msec

With sync set to always it’s getting to about

300-ish MB/sec **without an SLOG**:

TEST: (groupid=0, jobs=8): err= 0: pid=797163: Sat Sep  7 22:50:22 2024
  write: IOPS=2333, BW=292MiB/s (306MB/s)(34.2GiB/120038msec); 0 zone resets
    slat (nsec): min=1400, max=151983, avg=7100.93, stdev=3644.86
    clat (usec): min=32149, max=75803, avg=54819.66, stdev=1893.37
     lat (usec): min=32151, max=75807, avg=54826.76, stdev=1893.41
    clat percentiles (usec):
     |  1.00th=[45876],  5.00th=[51643], 10.00th=[52167], 20.00th=[54789],
     | 30.00th=[54789], 40.00th=[55313], 50.00th=[55313], 60.00th=[55313],
     | 70.00th=[55313], 80.00th=[55837], 90.00th=[55837], 95.00th=[55837],
     | 99.00th=[58459], 99.50th=[58983], 99.90th=[60556], 99.95th=[62653],
     | 99.99th=[65274]
   bw (  KiB/s): min=276480, max=360448, per=100.00%, avg=298751.27, stdev=1195.38, samples=1912
   iops        : min= 2160, max= 2816, avg=2333.99, stdev= 9.34, samples=1912
  lat (msec)   : 50=2.59%, 100=97.41%
  cpu          : usr=0.39%, sys=0.10%, ctx=140067, majf=0, minf=174
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=50.0%, 16=50.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=95.8%, 8=0.1%, 16=4.2%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,280064,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=292MiB/s (306MB/s), 292MiB/s-292MiB/s (306MB/s-306MB/s), io=34.2GiB (36.7GB), run=120038-120038msec

Davvo · September 8, 2024, 9:30am

Those seem reasonable results to me. You may want to take a look at SLOG benchmarking and finding the best SLOG | TrueNAS Community (both resource and discussion thread) if you have not already.

cerbie · September 8, 2024, 3:35pm

Thank you. Really appreciate you taking the time to answer my questions.

I am running TN Scale. Closest thing in Linux land that I am aware of that talks nvme is nvme-cli but to the best of my knowledge there is no equivalent benchmarking feature to diskinfo -wS.

Benchmarking under fio using the exact same parameters used above for consistency I am getting over 2.2 GB/sec from the optane as a standalone vdev which is in line with expectations.

Coming into this new, some of the performance aspects (for sync io in particular) are quite a bit lower than I was expecting or rather, hoping. Having said that if they seem consistent with what others are getting with roughly similar setups, I guess it is what it is. At least I know it’s not a hardware issue or misconfiguration holding me back.

I still think some kind of table with a few example layouts and “typical performance” figures for given fio parameters would help people align and manage expectations and hopefully even reduce/stop silly threads such as this

Davvo · September 8, 2024, 6:33pm

You could easily spin up a CORE install for benchmarking purposes.

This thread is not silly, and you have given us a really good idea… it’s something us as a community could work to put togheter.

Which were your expectations?

cerbie · September 8, 2024, 8:41pm

This is a bare metal installation unfortunately so I would have to reinstall to do that.

I have a basic understanding of how ZFS works and I definitely expected a penalty, especially for synchronous IO. What there doesn’t seem to be much literature out there about is just how much of a penalty to expect in objective terms. Lots of “severe impact, reduction, penalty” wording out there but this means different things to different people.

When looking at a pool comprising of 4, 2-wide mirror vdevs of SSDs I expected to see perhaps 50-60% penalty. I invested in enterprise-grade hardware to mitigate some of the impact for this reason and have more consistency.

If you take the Optane SLOG out, the sync performance is a bit below 15% that of the async, so 85% penalty. That’s beyond what I expected for an ssd-only pool. This is on a completely empty pool as well, TRIMed SSDs, so ideal conditions as far as ZFS goes. Even with the SLOG, still seeing 35% of the async performance.

I appreciate what we get in return but IMHO it is good to know what you can expect in advance, which can help you establish if the trade-off is acceptable to you and your use case, hence the suggestion. Standardised testing against set layouts should help set at least rough expectations.

I get that this won’t stop some people throwing 4 laptop hdds from ebay on usb2 caddies and a “80TB” consumer SSD they got for $40 on Temu together on a raspberry pi 4 and expect to saturate 100Gbit links but should help the people willing to read up.

For what it’s worth please let me know if this gets any traction. More than happy to help if I can.

Protopia · September 8, 2024, 9:29pm

Actually it depends what you mean by a performance hit.

In throughput terms i.e. the maximum sustained write speed (for massive amounts of data far larger than ARC - where writes are limited by getting the data out to disk), you would expect a hit of 50% without an SLOG - because first it writes to ZIL, and then it writes again to the data pool proper - so it takes two writes to disk vs. a single write to disk for an asynchronous write.

But in response time terms i.e. the time take to write a single I/O or a relatively small amount of data that can be held in ARC, the difference is between writing to disk and writing to RAM - so it is not surprising that the response time hit is going to be so large. In essence, synchronous writes are limited by the sum of network and disk speed, and asynchronous writes only by network speed.

This is why you need to understand the technical differences between synchronous and asynchronous writes so that if you are thinking about forcing asynchronous writes you can understand the potential impact of loss of data if you have an O/S crash or power outage.

Moving files from Windows using SMB is always asynchronous, but moving files from Linux over NFS is synchronous by default. But if the Windows async file move is considered OK (despite potential loss of data i.e. if you move a file and it is deleted on Windows before it is written on TrueNAS), then why shouldn’t forcing async for files moved from Linux over NFS be OK?
If the data is a Zvol used for a VM what is the impact? Is it any different to data that hasn’t been synced if the VM O/S was running native?
BUT, if the data is e.g. database transactions, then perhaps sync is essential.

My advice (in general - but every situation is unique) will be:

If write performance isn’t critical, leave everything at defaults.
If write performance is critical, but data loss on rare O/S crash is OK, then force async.
If data loss is completely unacceptable, then stick with sync. and deal with the performance issues in other ways.

cerbie · September 8, 2024, 10:07pm

Thank you for confirming. This is what I also had in mind RE: throughput.

Appreciate the point around the response time as well. To mitigate latency for sync writes I went with the Optane option for the SLOG, as an alternative to dealing with something like the Radian offerings. I also added as much RAM as I could manage to further mitigate this, so I have certainly considered these factors. It’s just difficult to appreciate just how much of a difference this presents in practice.

My use case is calling for both sync and async workloads. I have some SMB shares dealing with large files - these are async as you say and I can confirm I am getting the expected performance there, no problems. Some of these will involve very large sequential writes/archival type scenarios, and will be moved to a HDD-based pool in the future.

I also have datasets where data loss is not acceptable and require sync IO as a result. The defaults work just fine for me which is where I have left them. I never had any intention of forcing async there.

Unless I have misunderstood what you are saying, my point is as per your example in terms of throughput you mention 50% penalty as a worse case (data > ARC) without an SLOG due to the write amplification, but the empirical testing is showing 85% penalty and this is under ideal conditions, which is why I was saying the performance was less than I anticipated.

Hope this explains.

Davvo · September 8, 2024, 11:29pm

Nay, from my understanding he’s saying at least 50% due to having double the write operation: how this impact performance is not linear and, as you experienced, not well documented.

Depending on (lots of) things I would expect syncwrites to impact performance by 60-90%.
~~The challenge of syncwrites is that the hardware required needs to perform well in mixed operations (simultaneous reads and writes): few SSDs are good at this, most of them being optanes.~~

cerbie · September 8, 2024, 11:31pm

Gotcha! Thanks for clarifying.

Stux · September 8, 2024, 11:42pm

Why reads?

Davvo · September 9, 2024, 5:11am

~~Because as data is being piled up, it’s also read.~~

Stux · September 9, 2024, 5:14am

SLOGs are never read from (except in an unexpected reboot situation)

Davvo · September 9, 2024, 5:19am

Woops, did the late hour got me? I am convincend mixed operations to be a factor somehow.

I think I got it from optanes being great in mixed operations, and made the concept backwards! Thanks @Stux for pointing this out.

Johnny_Fartpants · September 9, 2024, 6:11am

I’ve seen this mentioned a lot in these forums but it’s not actually true. In my experience most Linux clients such as Ubuntu for example will be async by default unless you tell the client otherwise.

Just don’t want people getting confused.

etorix · September 9, 2024, 7:35am

Quoting for emphasis: This is absolutely expected performance. Sync writes are an order of magnitude slower than async; SLOG makes sync writes suck less but still nowhere near async.

cerbie · September 9, 2024, 8:17am

Understood - the order of magnitude bit was what I was looking for from the beginning. I had no frame of reference.

Great to have this resource available to ask these questions or I would be perpetually sitting there wondering if this is normal or if there is a misconfiguration somewhere. Thank you