Unsure of performance of SSD setup with external disk enclosure

Hello everyone,

I’m looking into improving the performance of my TrueNAS SCALE system and would appreciate any insights from the community.

Hardware Setup

  • TrueNAS Host: HPE ProLiant DL380 Gen9 V4 w/ 2x Xeon E5-2699v4 22-Core 2.20 GHz, 128 GB ECC RAM, running TrueNAS 25.04.1.
  • HBA: Broadcom/LSI SAS 9300-8e HBA in IT mode.
  • External Enclosure: HPE D3710, connected via a single SFF-8644-SAS-3 cable (since TrueNAS doesn’t support multipath).
  • Disks: 14x Intel D3-S4520, 11x Samsung PM863; both being SATA SSDs (only 6G, rated for something like 450 MB/s sequential write/read).
  • ZFS Pool: 11 mirror vdevs (2 disks per vdev), combined into one pool, 2 spares. LZ4 compression, no dedup. No special vdevs for now.

I (locally) used fio --name=write-test --directory=/mnt/Pool1/Dataset1 --rw=write --bs=1M --size=20G --numjobs=1 --iodepth=64 --direct=1 --ioengine=libaio --group_reporting and got ** WRITE: bw=857MiB/s (899MB/s). This was relatively consistent across multiple tests. I played around with bs, numjobs and the ioengine but the speed stayed pretty much the same (give or take a few MB/s).

Is that a speed that is reasonable with this setup? With 11 vdevs, I personally would have expected/hoped to get closer to something in the order of 1 - 2 GB/s.

How would I either debug (if not reasonable) or improve (if reasonable) this?

Thanks
xrm

Sounds normal. See my post in another thread. You would have to sort out where you are bottled

You need multiple threads writing - try --numjobs=10.

Well, I did try changing to more jobs, but that didn’t change the throughput, sadly:

root@storage01[~]# fio --name=write-test --directory=/mnt/Pool1/Dataset1 --rw=write --bs=1M --size=20G --numjobs=10 --iodepth=64 --direct=1 --ioengine=libaio --group_reporting
write-test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64
...
fio-3.33
Starting 10 processes
write-test: Laying out IO file (1 file / 20480MiB)
write-test: Laying out IO file (1 file / 20480MiB)
write-test: Laying out IO file (1 file / 20480MiB)
write-test: Laying out IO file (1 file / 20480MiB)
write-test: Laying out IO file (1 file / 20480MiB)
write-test: Laying out IO file (1 file / 20480MiB)
Jobs: 10 (f=10): [W(10)][99.6%][w=830MiB/s][w=830 IOPS][eta 00m:01s]
write-test: (groupid=0, jobs=10): err= 0: pid=58976: Sun Jun 29 03:51:56 2025
  write: IOPS=838, BW=839MiB/s (879MB/s)(200GiB/244172msec); 0 zone resets
    slat (usec): min=428, max=61031, avg=11904.08, stdev=2744.52
    clat (usec): min=13, max=997987, avg=750696.62, stdev=98354.82
     lat (msec): min=2, max=1010, avg=762.60, stdev=99.75
    clat percentiles (msec):
     |  1.00th=[   86],  5.00th=[  701], 10.00th=[  718], 20.00th=[  735],
     | 30.00th=[  743], 40.00th=[  751], 50.00th=[  760], 60.00th=[  768],
     | 70.00th=[  776], 80.00th=[  785], 90.00th=[  810], 95.00th=[  827],
     | 99.00th=[  919], 99.50th=[  944], 99.90th=[  969], 99.95th=[  986],
     | 99.99th=[  995]
   bw (  KiB/s): min=573440, max=5916905, per=99.82%, avg=857308.86, stdev=23800.06, samples=4870
   iops        : min=  560, max= 5778, avg=837.04, stdev=23.24, samples=4870
  lat (usec)   : 20=0.01%, 50=0.01%
  lat (msec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.04%, 100=1.06%
  lat (msec)   : 250=0.56%, 500=0.37%, 750=38.49%, 1000=59.45%
  cpu          : usr=0.87%, sys=6.87%, ctx=207319, majf=0, minf=24171
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.7%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,204800,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=839MiB/s (879MB/s), 839MiB/s-839MiB/s (879MB/s-879MB/s), io=200GiB (215GB), run=244172-244172msec

@SmallBarky: Hm, interesting. But that’s about what I’d expect:

 1x 4TB, single drive,          3.7 TB,  w=108MB/s , rw=50MB/s  , r=204MB/s 
24x 4TB, 12 striped mirrors,   45.2 TB,  w=696MB/s , rw=144MB/s , r=898MB/s

Not exactly times 12, but still times 6. I, on the other hand, seem to have a mere speedup of 1.5x. Their backplane is connected by 8 lanes, sure, but only at 6G, so I would expect the performance increase of my setup to be on par with theirs (4 lanes at 12G)?

Any idea where to look for the bottleneck? I’m somewhat stumped having verified that the PCIe matches and that all the components at least seem to be fully functional.

Maybe look at Solid State (Pure SSD) raids section? 24x data was closest to what you were expecting

1x 256GB  a single drive  232 gigabytes ( w= 441MB/s , rw=224MB/s , r= 506MB/s )
2x 256GB  raid0 striped   464 gigabytes ( w= 933MB/s , rw=457MB/s , r=1020MB/s )
2x 256GB  raid1 mirror    232 gigabytes ( w= 430MB/s , rw=300MB/s , r= 990MB/s )
24x 256GB raid0 striped   5.5 terabytes ( w=1620MB/s , rw=796MB/s , r=2043MB/s )

If I understand the “multipath” limitation, it is multiple CONTROLLERs. Not additional SAS lanes.

So I would immediately wire up the second x4 SAS lane port to the external enclosure. From what I can see with a quick Internet search, their are 2 connectors per I/O module. I would guess that each I/O module uses a different SAS path. So I would wire both your SAS 9300-8e to the same I/O module.

If everything seems stable, then re-run your performance tests.

One issue I can see is that each 9300 SAS lane is only capable of 12Gbps, assuming the enclosure HPE D3710 uses a 12Gbps SAS expander. But, you have 25 SSDs, each capable of about 5Gbps, so a total of 125Gbps. However, 4 lanes of 12Gbps is only 48Gbps. But, with 8 lanes, you get up to 96Gbps. Much closer to what the SSDs are capable.

It is worse if the HPE D3710 uses a 6Gbps SAS expander. However, the proposed solution still doubles the potential throughput.

To be clear, SAS does not properly load share each SAS lane that is wired to the same SAS expander. Thus, their is not a perfect translation of lane speed times number of lanes, to be the maximum throughput.

Now I don’t know this is your issue. But it should be pretty straight forward to test.

2 Likes

Clearly I was wrong about needing parallel jobs, but I think that @arwen has hit the target here.

@SmallBarky 's link to ZFS Raidz Performance, Capacity and Integrity Comparison @ Calomel.org is interesting however there are a couple of caveats I have spotted that mean that it may not be useful for setting expectations:

  1. The benchmarks do not scale even close to linearly i.e. a stripe of 24 disks gets only c. 6x the throughput - which suggests that there is another bottleneck elsewhere that is preventing it scaling almost linearly. Perhaps that benchmark was also constrained by PCIe lane capacity?

  2. Unfortunately the benchmarks for SSDs didn’t include as many configurations as the HDD benchmarks, but the scaling constraints seem to be very similar.

So, whilst @arwen’s cause sounds likely, I am not at all sure what throughput should be expected.

@SmallBarky & @Protopia Whoops, my bad. So at least x4 or ~ 1.5 GBps should be possible-ish. That’s still like x2 away from what I measured, I guess? Given that the individual disk’s performance is not worse for whatever reason (note to myself: test this just to be sure). The HBA is also connect by 8 PCIe 3.0 lanes, so that should be … about 8 GB/s (in theory)?

@Arwen Oh, good point. I remember having connected the two different I/O modules to the HBA and since the disks then appeared twice, I completely forgo the idea of connecting both ports. I’ll try connecting them to the same I/O ASAP, thanks! The D3710 should have a 12G expander from what I found. I am still searching for a way to confirm that the connection is indead a 4x 12G connection and not somehow downgraded by the cable or something.

Thank you all for your input - I’ll test next week and will report back.

1 Like

The second SAS cable made quite the difference:

write-test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=64
fio-3.33
Starting 1 process
Jobs: 1 (f=0): [f(1)][100.0%][w=1295MiB/s][w=1295 IOPS][eta 00m:00s]
write-test: (groupid=0, jobs=1): err= 0: pid=2501882: Mon Jul  7 19:03:26 2025
 write: IOPS=1394, BW=1395MiB/s (1462MB/s)(20.0GiB/14685msec); 0 zone resets
   slat (usec): min=399, max=15402, avg=711.17, stdev=314.10
   clat (usec): min=8, max=101429, avg=45067.93, stdev=8544.98
    lat (usec): min=696, max=102209, avg=45779.09, stdev=8660.35
   clat percentiles (msec):
    |  1.00th=[   30],  5.00th=[   34], 10.00th=[   35], 20.00th=[   37],
    | 30.00th=[   40], 40.00th=[   45], 50.00th=[   48], 60.00th=[   49],
    | 70.00th=[   50], 80.00th=[   51], 90.00th=[   53], 95.00th=[   55],
    | 99.00th=[   74], 99.50th=[   88], 99.90th=[  102], 99.95th=[  102],
    | 99.99th=[  102]
  bw (  MiB/s): min= 1190, max= 1886, per=99.84%, avg=1392.34, stdev=210.41, samples=29
  iops        : min= 1190, max= 1886, avg=1392.34, stdev=210.41, samples=29
 lat (usec)   : 10=0.01%, 750=0.01%
 lat (msec)   : 2=0.01%, 4=0.01%, 10=0.03%, 20=0.06%, 50=77.09%
 lat (msec)   : 100=22.57%, 250=0.21%
 cpu          : usr=11.62%, sys=79.75%, ctx=10691, majf=0, minf=15919
 IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.7%
    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
    issued rwts: total=0,20480,0,0 short=0,0,0,0 dropped=0,0,0,0
    latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
 WRITE: bw=1395MiB/s (1462MB/s), 1395MiB/s-1395MiB/s (1462MB/s-1462MB/s), io=20.0GiB (21.5GB), run=14685-14685msec

That’s a lot closer to what I’d expect, thanks! I’m still surprised that SAS can not load share the lanes better, but at least it works. :slight_smile: Thank you all, especially to @Arwen !

2 Likes

You’re welcome.

Yes, me too.

About 10 years ago, I remember reading that out of the 3 available multi-pathing software for Fibre Channel, EMC had the best. Veritas Volume Manager worked reasonably good, and so did Solaris 10/11 builtin one. But, with EMC’s PowerPath, (I think that is the name), not only did it work better, it was / is available on multiple OSes: Linux, AIX and Solaris.

However, with the migration to SAS, it is now up to the SAS controller’s firmware to decide pathing. Much less eyes looking at it, (if even available outside the SAS controller’s firmware development group). Plus, SAS standard requires any responses to come back the same path, (if I remember correctly).