Need help: Mach.2 dual-actuator drive pool poor performance

Hello,

I just set up my first NAS using TrueNAS, and the performance is below my expectations. I’d like to ask about potential problems, as online searches didn’t give me a clear answer.

Environment

The following is my environment:

  • CPU: E5-2686 v4
  • Memory: 64GB ECC DDR4
  • TrueNAS version: 25.10.1
  • 2.5Gb Ethernet connection.

For storage, I have 5 14TB Seagate dual actuator drives (Mach. 2). Because the interface is SATA, I can’t create the pool in the GUI if I want to utilize the full performance (the first actuator is in charge of the first part of the drive, and the second actuator is for the second half, so I need to use partitions to make the vdevs). I used the script mentioned here to partition each of the drives into two equal parts and created the pool in the command line.

And my pool structure looks like this

zpool create zp0 \
        raidz2 \
                /dev/disk/by-id/ata-ST14000NM0121_ZKL2QA73-part1 \
                /dev/disk/by-id/ata-ST14000NM0121_ZKL2QVNV-part1 \
                /dev/disk/by-id/ata-ST14000NM0121_ZKL2QZXM-part1 \
                /dev/disk/by-id/ata-ST14000NM0121_ZKL2R3HC-part1 \
                /dev/disk/by-id/ata-ST14000NM0121_ZKL2RJ1V-part1 \
        raidz2 \
                /dev/disk/by-id/ata-ST14000NM0121_ZKL2QA73-part2 \
                /dev/disk/by-id/ata-ST14000NM0121_ZKL2QVNV-part2 \
                /dev/disk/by-id/ata-ST14000NM0121_ZKL2QZXM-part2 \
                /dev/disk/by-id/ata-ST14000NM0121_ZKL2R3HC-part2 \
                /dev/disk/by-id/ata-ST14000NM0121_ZKL2RJ1V-part2 \
        cache \
                /dev/disk/by-id/nvme-HUSMR7638BDP3Y1_SDM000079040-part1

Basically, the first partition of each drive goes to the first vdev of raidz2, and the second partition of each drive goes to the second vdev of raidz2. There’s also a ssd L2ARC (256GB).

The idea is that all vdev1 drives use the first actuator and all vdev2 drives use the second actuator, so this should reach a higher performance.

Theoretical performance

According to this article, the theoretical performance of a single raidz vdev can be calculated as follows:

N -wide RAIDZ, parity level p :

  • Read IOPS: Read IOPS of single drive
  • Write IOPS: Write IOPS of single drive
  • Streaming read speed: (N – p) * Streaming read speed of single drive
  • Streaming write speed: (N – p) * Streaming write speed of single drive
  • Storage space efficiency: (N – p)/N
  • Fault tolerance: 1 disk per vdev for Z1, 2 for Z2, 3 for Z3 [p]

And adding vdevs will increase both IOPS and streaming rw speed. Since my application will mostly rely on streaming rw speed, I’ll focus only on this part.

I have tested with fio that, each independent partition of the Mach. 2 drive can reach a streaming rw speed of 200~250MB/s. I’ll use the lower end of 200MB/s for simplicity of calculation.

For each of the raidz2 vdevs, the theoretical streaming speed should be (5-2) * 200 = 600MB/s. And since there are two vdevs, the overall streaming speed should be 600 * 2 = 1200MB/s.

P.S. The fio parameters I used for testing:

[global]
direct=1
time_based=1
runtime=30
ramp_time=3
thread=1
group_reporting=0

ioengine=io_uring

[seqwrite_act0]
rw=write
bs=4m
iodepth=2
size=50%

[seqwrite_act1]
rw=write
bs=4m
iodepth=2
offset=50%
size=50%

[seqread_act0]
stonewall=1
rw=read
bs=4m
iodepth=2
size=50%

[seqread_act1]
rw=read
bs=4m
iodepth=2
offset=50%
size=50%

Testing

SMB

I set up SMB with a dataset of record size = 4M, and the rest are default.

Sending a single large file can reach the speed of ~285MB/s, which is pretty much the upper limit of 2.5Gb network (to prevent the bottleneck of HDD, the files I used are transferred from SSD).

However, sending a folder of RAW (~20MB) and jpg ~(5-10MB) files can only reach a speed of around 220-230MB/s.

This speed is much lower than the theoretical prediction, so I started to investigate further.

TN-Bench

I read this post and tried out the benchmarking tools (it’s designed really well). Here’s my results

############################################################
#                    Testing Pool: zp0                     #
############################################################

* Creating test dataset for pool: zp0
✓ Dataset zp0/tn-bench created successfully.

============================================================
 Space Verification
============================================================

* Available space: 38435.71 GiB
* Space required:  720.00 GiB (20 GiB/thread × 36 threads)
✓ Sufficient space available - proceeding with benchmarks

============================================================
 Testing Pool: zp0 - Threads: 1
============================================================

* Running DD write benchmark with 1 threads...
* Run 1 write speed: 288.37 MB/s
✓ Average write speed: 288.37 MB/s
* Running DD read benchmark with 1 threads...
* Run 1 read speed: 6916.52 MB/s
✓ Average read speed: 6916.52 MB/s

============================================================
 Testing Pool: zp0 - Threads: 9
============================================================

* Running DD write benchmark with 9 threads...
* Run 1 write speed: 651.72 MB/s
✓ Average write speed: 651.72 MB/s
* Running DD read benchmark with 9 threads...
* Run 1 read speed: 2094.20 MB/s
✓ Average read speed: 2094.20 MB/s

============================================================
 Testing Pool: zp0 - Threads: 18
============================================================

* Running DD write benchmark with 18 threads...
* Run 1 write speed: 618.07 MB/s
✓ Average write speed: 618.07 MB/s
* Running DD read benchmark with 18 threads...
* Run 1 read speed: 2142.09 MB/s
✓ Average read speed: 2142.09 MB/s

Note that, since I didn’t set zfs_arc_max to 1 to prevent ARC as instructed by TN-Bench, the read speeds are inaccurate.

As you can see, the maximum write speed happens when thread=9. It is still only about 650MB/s, however.

Also, in a single-thread scenario, the speed dropped to ~290MB/s, and I don’t know why that is happening.

fio

The configuration:

[global]
directory=/mnt/zp0/media
filename=fio_testfile
size=50g

time_based=1
runtime=30
ramp_time=3

ioengine=io_uring
direct=1
thread=1
numjobs=1


stonewall=1

[seqwrite_4m_q8]
rw=write
bs=4m
iodepth=8

[seqread_4m_q8]
rw=read
bs=4m
iodepth=8

[randwrite_4k_q16]
rw=randwrite
bs=4k
iodepth=16

[randread_4k_q16]
rw=randread
bs=4k
iodepth=16

The result:

seqwrite_4m_q8: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=io_uring, iodepth=8
seqread_4m_q8: (g=1): rw=read, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=io_uring, iodepth=8
randwrite_4k_q16: (g=2): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=16
randread_4k_q16: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=16
fio-3.33
Starting 4 threads
Jobs: 1 (f=1): [_(3),r(1)][59.6%][r=2932KiB/s][r=733 IOPS][eta 01m:30s]
seqwrite_4m_q8: (groupid=0, jobs=1): err= 0: pid=268715: Sun Jan 25 20:43:00 2026
  write: IOPS=55, BW=224MiB/s (235MB/s)(6756MiB/30132msec); 0 zone resets
    slat (usec): min=36, max=192, avg=103.81, stdev=23.43
    clat (msec): min=132, max=291, avg=142.82, stdev=21.28
     lat (msec): min=133, max=291, avg=142.92, stdev=21.28
    clat percentiles (msec):
     |  1.00th=[  134],  5.00th=[  134], 10.00th=[  136], 20.00th=[  136],
     | 30.00th=[  136], 40.00th=[  136], 50.00th=[  136], 60.00th=[  138],
     | 70.00th=[  138], 80.00th=[  142], 90.00th=[  150], 95.00th=[  190],
     | 99.00th=[  253], 99.50th=[  257], 99.90th=[  275], 99.95th=[  292],
     | 99.99th=[  292]
   bw (  KiB/s): min=147456, max=246252, per=99.99%, avg=229572.70, stdev=21295.49, samples=60
   iops        : min=   36, max=   60, avg=56.03, stdev= 5.20, samples=60
  lat (msec)   : 250=99.35%, 500=1.07%
  cpu          : usr=0.62%, sys=0.02%, ctx=1689, majf=0, minf=0
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1682,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
seqread_4m_q8: (groupid=1, jobs=1): err= 0: pid=268772: Sun Jan 25 20:43:00 2026
  read: IOPS=703, BW=2814MiB/s (2951MB/s)(82.5GiB/30015msec)
    slat (nsec): min=1025, max=50005, avg=2990.44, stdev=3541.67
    clat (usec): min=248, max=423912, avg=11371.81, stdev=19383.19
     lat (usec): min=250, max=423933, avg=11374.80, stdev=19384.41
    clat percentiles (usec):
     |  1.00th=[   255],  5.00th=[   273], 10.00th=[   281], 20.00th=[   285],
     | 30.00th=[   293], 40.00th=[   310], 50.00th=[   482], 60.00th=[  2409],
     | 70.00th=[ 14353], 80.00th=[ 22938], 90.00th=[ 35390], 95.00th=[ 42730],
     | 99.00th=[ 81265], 99.50th=[109577], 99.90th=[173016], 99.95th=[206570],
     | 99.99th=[299893]
   bw (  MiB/s): min=  737, max=12776, per=100.00%, avg=2815.51, stdev=3316.43, samples=60
   iops        : min=  184, max= 3194, avg=703.80, stdev=829.09, samples=60
  lat (usec)   : 250=0.13%, 500=50.04%, 750=3.17%, 1000=0.34%
  lat (msec)   : 2=5.02%, 4=2.16%, 10=3.69%, 20=10.68%, 50=21.35%
  lat (msec)   : 100=2.83%, 250=0.58%, 500=0.04%
  cpu          : usr=0.32%, sys=0.40%, ctx=20902, majf=0, minf=0
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=21112,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
randwrite_4k_q16: (groupid=2, jobs=1): err= 0: pid=268858: Sun Jan 25 20:43:00 2026
  write: IOPS=97, BW=391KiB/s (400kB/s)(11.5MiB/30129msec); 0 zone resets
    slat (nsec): min=1234, max=41558, avg=4846.25, stdev=3295.09
    clat (msec): min=40, max=821, avg=164.03, stdev=87.42
     lat (msec): min=40, max=821, avg=164.04, stdev=87.42
    clat percentiles (msec):
     |  1.00th=[   72],  5.00th=[   88], 10.00th=[   99], 20.00th=[  110],
     | 30.00th=[  121], 40.00th=[  129], 50.00th=[  138], 60.00th=[  148],
     | 70.00th=[  163], 80.00th=[  192], 90.00th=[  284], 95.00th=[  342],
     | 99.00th=[  535], 99.50th=[  676], 99.90th=[  785], 99.95th=[  785],
     | 99.99th=[  818]
   bw (  KiB/s): min=  104, max=  584, per=99.75%, avg=390.60, stdev=132.44, samples=60
   iops        : min=   26, max=  146, avg=97.63, stdev=33.09, samples=60
  lat (msec)   : 50=0.07%, 100=11.67%, 250=75.39%, 500=12.32%, 750=0.96%
  lat (msec)   : 1000=0.10%
  cpu          : usr=0.08%, sys=0.14%, ctx=4133, majf=0, minf=0
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2930,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16
randread_4k_q16: (groupid=3, jobs=1): err= 0: pid=269031: Sun Jan 25 20:43:00 2026
  read: IOPS=707, BW=2831KiB/s (2899kB/s)(83.1MiB/30074msec)
    slat (nsec): min=1081, max=22709, avg=2397.51, stdev=1599.04
    clat (usec): min=6, max=501254, avg=22636.54, stdev=37739.07
     lat (usec): min=7, max=501256, avg=22638.93, stdev=37739.08
    clat percentiles (usec):
     |  1.00th=[    12],  5.00th=[    13], 10.00th=[    14], 20.00th=[    15],
     | 30.00th=[    18], 40.00th=[    22], 50.00th=[    24], 60.00th=[ 15795],
     | 70.00th=[ 28443], 80.00th=[ 42730], 90.00th=[ 67634], 95.00th=[ 94897],
     | 99.00th=[168821], 99.50th=[204473], 99.90th=[295699], 99.95th=[350225],
     | 99.99th=[413139]
   bw (  KiB/s): min= 1288, max= 3424, per=100.00%, avg=2836.90, stdev=326.69, samples=60
   iops        : min=  322, max=  856, avg=709.17, stdev=81.64, samples=60
  lat (usec)   : 10=0.10%, 20=34.92%, 50=20.70%, 100=0.21%, 250=0.04%
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.02%
  lat (msec)   : 2=0.12%, 4=0.18%, 10=1.05%, 20=5.55%, 50=21.00%
  lat (msec)   : 100=11.73%, 250=4.18%, 500=0.24%, 750=0.01%
  cpu          : usr=0.28%, sys=0.36%, ctx=21123, majf=0, minf=0
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=21271,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: bw=224MiB/s (235MB/s), 224MiB/s-224MiB/s (235MB/s-235MB/s), io=6756MiB (7084MB), run=30132-30132msec

Run status group 1 (all jobs):
   READ: bw=2814MiB/s (2951MB/s), 2814MiB/s-2814MiB/s (2951MB/s-2951MB/s), io=82.5GiB (88.6GB), run=30015-30015msec

Run status group 2 (all jobs):
  WRITE: bw=391KiB/s (400kB/s), 391KiB/s-391KiB/s (400kB/s-400kB/s), io=11.5MiB (12.1MB), run=30129-30129msec

Run status group 3 (all jobs):
   READ: bw=2831KiB/s (2899kB/s), 2831KiB/s-2831KiB/s (2899kB/s-2899kB/s), io=83.1MiB (87.2MB), run=30074-30074msec

The result is only 235MB/s for streaming write. In my understanding, using ioengine=io_uring and a larger iodepth like 8 can enable larger concurrency. But the result is even worse than the single-thread performance measured using TN-Bench.

My Questions

  • Why is the streaming write speed significantly lower than the theoretical value? Did I set up the pool the wrong way? Or are there any other bottlenecks in my system, like poor CPU single-core performance?
  • In TN-Bench, why does running dd in multiple threads increase the performance? I thought that, for writing a single large file, this shouldn’t matter that much.
  • If concurrency is needed in reaching a better streaming write speed, then why is a larger iodepth not helpful when testing using fio (I tested iodepth=1 and the result is 233MB/s)?

Thank you very much for reading this! Any suggestions or recommendations of resources/articles are helpful.

Just tried setting direct=0 in fio, and that’ll make the streaming write speed for the pool to 619MB/s. I guess that solves my third question.

Perhaps the original low speed is limited by the performance of a single HDD.

Hey there, I’ll try to do your questions a bit of justice, so bear with me here. So, yeah, I’ve seen this kind of “math says 1200 MB/s but reality says nope” setup before, and there are a few gotchas hiding in plain sight here.

Your SMB numbers are basically… network limited

On 2.5GbE, the real-world ceiling is roughly 280–300 MB/s once you include overhead. So your ~285 MB/s on one big file is actually “as good as it gets” over that link.

And the 220–230 MB/s on folders of 5–20MB files is normal-ish because SMB is doing way more work: metadata ops, open/close chatter, directory updates, small I/O patterns, etc. Small files always “feel slower” than one giant stream.

The bigger issue is the pool layout. Both RAIDZ2 vdevs are built from partitions on the same physical disks, so they are not independent.

1 Like

Just FYI from another post:

1 Like

If done well, partitions here are served by different actuators so they actually ARE mechanically independent.

But network indeed appears to be the limiting factor.

1 Like

Hi,

Thank you for the reply.

I understand that the SMB is network-limited. However, when I run benchmarks locally, the speed doesn’t seem to be close to the theoretical value. So I’m wondering if I did something wrong in the setup.

The bigger issue is the pool layout. Both RAIDZ2 vdevs are built from partitions on the same physical disks, so they are not independent.

For Seagate’s Mach. 2 drive, there are two physical actuators. For the SATA version, they are actually independent if I partition the disk into two equal parts. If the interface were SAS, this would have been easier because it would have shown up as separate drives (or two LUNs of the same drive).

When setting up the partition, I referred to these two articles. Mostly the second one, asI used the script provided there.

I have tested with fio that, when running streaming r/w on the two partitions simultaneously, the speed of each partition can reach 200-250MB/s, so I can confirm that the drives themselves are fine.

1 Like

Thanks for letting me know, just upvoted the feature request.

Perhaps something in the revised script in the second link doesn’t set the partitions up properly? I don’t have time to look through it though, so take it for what it is.

1 Like
truenas_admin@truenas ~ [1]> sudo parted /dev/sdf
GNU Parted 3.5
Using /dev/sdf
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: ATA ST14000NM0121 (scsi)
Disk /dev/sdf: 14.0TB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:

Number  Start   End     Size    File system  Name              Flags
 1      2097kB  7000GB  7000GB  zfs          MACH2_A_ZKL2QA73
 2      7000GB  14.0TB  7000GB  zfs          MACH2_B_ZKL2QA73

Hi,

Thank you for the reply. I used parted to check the partition of one of the Mach. 2 drive and this is what it looks like. Seems fine for me, as these are two equal partitions. Please let me know if there are any subtle problems.

The 2686v4 is only clocked at 2.3GHz so single-threaded performance (numjobs=1) may be a bottleneck for your streaming writes, but I’ll have to compare against some local tests.

In general though, more jobs/parallelism is going needed to make all of the drives get active, since you’ve set a dataset with recordsize=4M on 2x5wZ2. Try re-running with numjobs=4 or higher since you’re not short on cores :slight_smile:

Can you show us the raw LBA numbers with sudo sfdisk -d /dev/sdf ? The rounding here of “7000G” might mean that your partitions aren’t exactly at the spindle split.

The feature request/statement before stands w.r.t dual-actuator though. Might be worth re-investigating though thanks to the “current state of the market” :thinking:

1 Like

Thanks for the reply. Here’s the result for sudo sfdisk -d /dev/sdf:

truenas_admin@truenas ~> sudo sfdisk -d /dev/sdf
label: gpt
label-id: 3B7E0E2E-13C9-4721-8D43-E1E02B6C9E11
device: /dev/sdf
unit: sectors
first-lba: 34
last-lba: 27344764894
sector-size: 512

/dev/sdf1 : start=        4096, size= 13672374272, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=C26F9CBC-8FA2-4F40-B16D-636AC5B11A65, name="MACH2_A_ZKL2QA73"
/dev/sdf2 : start= 13672386560, size= 13672374272, type=0FC63DAF-8483-4772-8E79-3D69D8477DE4, uuid=BCBB676B-FA8A-4385-BE7A-422C9484E51D, name="MACH2_B_ZKL2QA73"
1 Like

Thanks - based on last-lba the split-point should be 13672382447 which your two partitions are neatly on either side of, so that shouldn’t be contributing to the issue.

Did you fire a fio with numjobs=4 by any chance to see if that lifted things up?

Great to hear that the partition is correct. I haven’t run fio with numjob=4 yet. Will get back to you soon once I have access to the machine. I wonder if running TN-Bench (which uses dd) with multiple threads will provide a similar result as running fio with numjob=4. For TN-Bench, the maximum performance is 650MB/s (still around half of the theoretical value), which is reached with 9 threads.

Maybe take a look at how ZFS is using the drives while testing.

Out of curiosity, is it as simple as last/2, or should it be (last - first)/2, assuming an equal number of reserved/unused sectors at each end? Anyway, that would give a mid-point of at 13672382430 and the same conclusion.

Actually, Seagate might have done us a solid here.

Seagate® has worked with the T13 ATA committee to propose and implement a new
log page for SATA—the Concurrent Positioning Ranges log page 47h identifies the number
of LBA ranges (in this case, actuators) within a device. For each LBA range the log page
specifies the lowest LBA and the number of LBAs. As a reminder, since LBA numbering
starts at zero, the last LBA of either range will be the lowest LBA + the number of LBAs – 1.

In Linux Kernel 5.19, the independent ranges can be found in /sys/block/<device>/ queue/independent_access_range. There is one sub-directory per actuator, starting with
the primary at “0.” The “nr_sectors” field reports how many sectors are managed by this
actuator, and “sector” is the offset for the first sector. Sectors then run contiguously to the
start of the next actuator’s range.

@ttzytt can you check to see if these directories are populated on your machine correctly?

4 Likes

Hi @HoneyBadger,

These directories indeed exist on my machine. Here’s the content of one of the drives:

truenas_admin@truenas /s/b/s/q/i/0> cat sector
0
truenas_admin@truenas /s/b/s/q/i/0> cat nr_sectors
13672382464
truenas_admin@truenas /s/b/s/q/i/0> cd ../1
truenas_admin@truenas /s/b/s/q/i/1> cat sector
13672382464
truenas_admin@truenas /s/b/s/q/i/1> cat nr_sectors
13672382464
2 Likes

Did you fire a fio with numjobs=4 by any chance to see if that lifted things up?

Just tested it and got 640 MB/s, so it seems concurrency isn’t the bottleneck here.