TN-Bench -- An OpenSource Community TrueNAS Benchmarking and Testing Utility

Reran the tests last night and this time it exited nicely, complete with a functioning output file. Do you want me to send it to you, Nick?

One of the things to consider for future editions is whether UUIDs, system names, etc. should be omitted, especially if this is to become a effort where more people participate. The current output file is a very verbose setting, you may consider creating a second output file option without the benefit / complexity of json that only lists the essentials.

I think I answer your question here:

Basically, ARC exists. Why should we run from it? I merely put breadcrumbs for those who want to pretend ARC does not exist.

According to the docs, it can’t be below 64MB (because 64MB ought to be enough for anybody). Moreover, they state:

It cannot be set back to 0 while running and reducing it below the current ARC size will not cause the ARC to shrink without memory pressure to induce shrinking.

The above is why my doc indicates you need to restart after setting it back to 0.

Up to you, is there a question you have about the result?

Yes, and this is an interesting topic. I have thought about this alot, and I have not yet decided on the exact direction.

For now:

  1. The standard output in the terminal window is designed to be human readable and easily parsed as the script runs. The output is very verbose intentially, because things tend to break more often on a system with hardware problems when under load :wink:
    Generating a results file mid-stream may not be possible, since I am doing math on the backend. I would prefer not generating incomplete output files, but for debugging purposes I can consider.

  2. I’d imagine folks like @mattheja will continue using this script to burn-in test, and I intend on adding the ability to loop, forever. :slight_smile:

  3. The JSON output is designed to be very verbose and it’s meant for machine-readability. I have considered what you are saying re “PII” like UUIDs or hostnames.
    IMO, ZFS UUIDs are globally unique and there’s no real security concern with sharing them. Hostnames/Serial numbers may be a differant story, and I have considered ommitting them.

  4. However, I’d like to eventually build an “Open-Source Leaderboard Database Website” and I might want to use these values as “keys” even though I don’t intend on presenting them.
    I think a “tuple” of a few keys could be a “TN-Bench” “GUID”. This way you can look through your own benchmark history for your system in the website.

I’m not sure yet, I just want to have the tools on the script side I’d need to make a compare systems page like 3DMark or PassMark. The JSON generated is owned by the user and it’s their right to share or not.

It’s still running for me, but something that’s striking is the abysmal write performance of my NVMe SSD pool. It’s a mirrored pair of Samsung SSD 990 EVO 2TB drives, and here’s what I’m seeing:

============================================================
 Space Verification
============================================================

* Available space: 1464.88 GiB
* Space required:  1120.00 GiB (20 GiB/thread × 56 threads)
✓ Sufficient space available - proceeding with benchmarks

============================================================
 Testing Pool: software - Threads: 1
============================================================

* Running DD write benchmark with 1 threads...
* Run 1 write speed: 313.32 MB/s
* Run 2 write speed: 310.43 MB/s
✓ Average write speed: 311.87 MB/s
* Running DD read benchmark with 1 threads...
* Run 1 read speed: 6146.37 MB/s
* Run 2 read speed: 5572.58 MB/s
✓ Average read speed: 5859.48 MB/s

============================================================
 Testing Pool: software - Threads: 14
============================================================

* Running DD write benchmark with 14 threads...
* Run 1 write speed: 237.03 MB/s
* Run 2 write speed: 277.38 MB/s
✓ Average write speed: 257.20 MB/s
* Running DD read benchmark with 14 threads...
* Run 1 read speed: 7019.43 MB/s
* Run 2 read speed: 8422.22 MB/s
✓ Average read speed: 7720.82 MB/s

These are in the two NVMe slots on my motherboard. I don’t have much experience with NVMe devices, but this seems off to me.

System configuration is in the .sig, but it’s 2x Xeon Gold 6132s, 128 GB RAM, in a SuperMicro X11DPH-T.

First thing to double check, and I assume this is not the issue but I just wanted to make sure.

Check bus width and generation for the current state of your NVME drives. Here’s a nifty ai generated one-liner.

root@prod:~# for d in /sys/block/nvme*; do dev=$(basename "$d"); pci=$(basename "$(readlink -f "$d/device/device")"); lanes=$(cat /sys/bus/pci/devices/$pci/current_link_width); rawspeed=$(cat /sys/bus/pci/devices/$pci/current_link_speed); speed=${rawspeed%% *}; case "$speed" in 2.5) gen=Gen1 ;; 5.0) gen=Gen2 ;; 8.0) gen=Gen3 ;; 16.0) gen=Gen4 ;; 32.0) gen=Gen5 ;; 64.0) gen=Gen6 ;; *) gen=Unknown ;; esac; printf "%-12s %-15s %-10s %-15s %-8s\n" "/dev/$dev" "$pci" "$lanes" "$rawspeed" "($gen)"; done
/dev/nvme0n1 0000:83:00.0    4          8.0 GT/s PCIe   (Gen3)
/dev/nvme1n1 0000:84:00.0    4          8.0 GT/s PCIe   (Gen3)
/dev/nvme2n1 0000:85:00.0    4          8.0 GT/s PCIe   (Gen3)
/dev/nvme3n1 0000:86:00.0    4          8.0 GT/s PCIe   (Gen3)
/dev/nvme4n1 0000:01:00.0    4          16.0 GT/s PCIe  (Gen4)
/dev/nvme5n1 0000:02:00.0    4          8.0 GT/s PCIe   (Gen3)
1 Like

Thanks.

root@truenas[~]# for d in /sys/block/nvme*; do dev=$(basename "$d"); pci=$(basename "$(readlink -f "$d/device/device")"); lanes=$(cat /sys/bus/pci/devices/$pci/current_link_width); rawspeed=$(cat /sys/bus/pci/devices/$pci/current_link_speed); speed=${rawspeed%% *}; case "$speed" in 2.5) gen=Gen1 ;; 5.0) gen=Gen2 ;; 8.0) gen=Gen3 ;; 16.0) gen=Gen4 ;; 32.0) gen=Gen5 ;; 64.0) gen=Gen6 ;; *) gen=Unknown ;; esac; printf "%-12s %-15s %-10s %-15s %-8s\n" "/dev/$dev" "$pci" "$lanes" "$rawspeed" "($gen)"; done

/dev/nvme0n1 0000:3b:00.0    4          8.0 GT/s PCIe   (Gen3)
/dev/nvme1n1 0000:3c:00.0    4          8.0 GT/s PCIe   (Gen3)
1 Like

I don’t have access to one of these drives, but I do have some older Samsung and other M.2s I can test. I can see where they land for the lulz.

I suspect we’ll find that consumer-y M.2 drives, and EVO (TLC) and QLC drives in particular will probably show pretty poor results in this test. You’re writing 1120.00 GiB to disk all at once, or roughly half a “drive write” in your case.

This should more than saturate any caches on the drive itself, and was an intentional design decision for the benchmark itself. Worst case scenario very busy system.

The drive firmware/controller will buffer some of the writes issued

  1. “L1” DRAM (and there are plenty of M.2s without ANY DRAM),
  2. “L2” some of these writes in “SLC” cache

and then the rest is flushed off to whatever multiple MLC/TLC/QLC is on the stick.

These cache’s tend to not be very big. I’ve not used this particular software, but his graphic here is perfect, so credit where credit is due.
https://tools4free.github.io/ssd-slow-mark/

I’ll try to get some more data for you dan, but this may be expected behavior.

You are also “downclocking” PCI-E 5 drives to a PCI-E 3 bus. So that’s an interesting variable to test as well. I can probably spin up TrueNAS on my gaming desktop some day, but don’t have any PCI-E 5 drives right now.

I’m more than a little surprised that write performance is worse that a pair of mirrored spinners. Writes slower than reads don’t surprise me a lot, but they’re a lot slower.

According to the product page, they’re Gen5x2, or Gen4x4. I’m not sure how that plays out here.

The short answer is, I think we’re writing more to the drive than its cache can handle. And when you only have a single (can you confirm if yours has 1 or 2?) TLC NAND package on an M.2 stick…I can absolutely see how it can perform worse than a hard drive. I’ve had some dog slow writing SSDs over the years.

Alot of my understanding of how SSDs work is credited to the work Allyn Malvantno did over at PCPer in the latter half of the past decade.
First some definitions for the thread

NAND Type Bits per Cell Voltage States per Cell
SLC 1 2
MLC 2 4
TLC 3 8
QLC 4 16

This is a perfect example of what I am describing. Although a little old, it’s points still exist in more modern SSDs.


Maybe someday I’ll get this level of granularity with TN-Bench.

EDIT: Also @dan interestingly, the CrystalDiskMark RND4KQ1T1 Benchmark result is right around where TN Bench is. It may be an interesting coicidence or maybe actually interesting.

Since ZFS is actually issuing 4K writes to these drives (ashift=12), this may actually be a cheeky way to figure out what ZFS write performance as measured by TN-Bench’s method might look like when shopping for drives. The way Crystal is doing what they are doing must be a similar enough method to get a roughly analogous number. For no other reason than everyone tests CrystalDiskMark for some reason.

It’s not a perfect analogy though.
Because we’re doing the dd 1M at a time here, we’re basically floored in the accelerator and I’m not sure what crystal is doing.

I think the reason your result in TN-Bench is worse than that result in Crystal is a combination of running in Gen3 mode and caching behaviors. It’s something weird at a hardware level.

1 Like

That’s only when using all 56 threads, right? Using a single thread, it’s only 20 GiB, and even there it only averages 312 MB/sec. Yeah, it’s worse with 14 threads, but it doesn’t have to get to the large writes to have a problem–and it sounds like it’s a caching issue. Interesting.

Yah, you’re right.
Hmm. Let me ponder and test some more.

These are also “downclocked” drives… but differant situation. These are x4 drives running at x2, but GEN3 for both.

root@notprod[~]# for d in /sys/block/nvme*; do dev=$(basename "$d"); pci=$(basename "$(readlink -f "$d/device/device")"); lanes=$(cat /sys/bus/pci/devices/$pci/current_link_width); rawspeed=$(cat /sys/bus/pci/devices/$pci/current_link_speed); speed=${rawspeed%% *}; case "$speed" in 2.5) gen=Gen1 ;; 5.0) gen=Gen2 ;; 8.0) gen=Gen3 ;; 16.0) gen=Gen4 ;; 32.0) gen=Gen5 ;; 64.0) gen=Gen6 ;; *) gen=Unknown ;; esac; printf "%-12s %-15s %-10s %-15s %-8s\n" "/dev/$dev" "$pci" "$lanes" "$rawspeed" "($gen)"; done
/dev/nvme0n1 0000:60:00.0    2          8.0 GT/s PCIe   (Gen3)
/dev/nvme1n1 0000:61:00.0    2          8.0 GT/s PCIe   (Gen3)
/dev/nvme2n1 0000:62:00.0    2          8.0 GT/s PCIe   (Gen3)
/dev/nvme3n1 0000:63:00.0    2          8.0 GT/s PCIe   (Gen3)

I have data on this system so I can’t break up this pool to test a mirror right now.

root@notprod[~]# zpool status fire
  pool: fire
 state: ONLINE
  scan: scrub repaired 0B in 00:21:55 with 0 errors on Wed Aug  6 00:21:59 2025
config:

	NAME                                      STATE     READ WRITE CKSUM
	fire                                      ONLINE       0     0     0
	  raidz1-0                                ONLINE       0     0     0
	    3388ac6e-cab2-44db-9263-fa2ba74533d6  ONLINE       0     0     0
	    d202b781-89b4-4a71-ae7a-c60f2ea42107  ONLINE       0     0     0
	    de403cf3-6f47-4fc4-8eb4-3c0b91aa76e7  ONLINE       0     0     0
	    1e922cd8-ba0e-4c36-b317-a8dd6ecb52eb  ONLINE       0     0     0

tn-bench on a Dual Socket Xeon Silver 4114 System

-----------+---------------------------
Name       | nvme0n1
Model      | INTEL SSDPE2KE016T8
Serial     | PHLN013100MD1P6AGN
ZFS GUID   | 17475493647287877073
Pool       | fire
Size (GiB) | 1400.00
-----------+---------------------------
Name       | nvme1n1
Model      | INTEL SSDPE2KE016T8
Serial     | PHLN931600FE1P6AGN
ZFS GUID   | 11275382002255862348
Pool       | fire
Size (GiB) | 1400.00
-----------+---------------------------
Name       | nvme2n1
Model      | SAMSUNG MZWLL1T6HEHP-00003
Serial     | S3HDNX0KB01220
ZFS GUID   | 4368323531340162613
Pool       | fire
Size (GiB) | 1399.22
-----------+---------------------------
Name       | nvme3n1
Model      | SAMSUNG MZWLL1T6HEHP-00003
Serial     | S3HDNX0KB01248
ZFS GUID   | 3818548647571812337
Pool       | fire
Size (GiB) | 1399.22
-----------+---------------------------
============================================================
 Testing Pool: ice - Threads: 1
============================================================

* Running DD write benchmark with 1 threads...
* Run 1 write speed: 233.13 MB/s
* Run 2 write speed: 232.29 MB/s
✓ Average write speed: 232.71 MB/s
* Running DD read benchmark with 1 threads...
* Run 1 read speed: 5878.96 MB/s
* Run 2 read speed: 5701.97 MB/s
✓ Average read speed: 5790.47 MB/s

============================================================
 Testing Pool: ice - Threads: 10
============================================================

* Running DD write benchmark with 10 threads...
* Run 1 write speed: 1824.02 MB/s
* Run 2 write speed: 1909.31 MB/s
✓ Average write speed: 1866.66 MB/s
* Running DD read benchmark with 10 threads...
* Run 1 read speed: 15102.95 MB/s
* Run 2 read speed: 29395.15 MB/s
✓ Average read speed: 22249.05 MB/s

So your result still looks consistent with what I am seeing.

And probably consistent also with what I am seeing here

None of this is apples-apples, but its the data we have.

But it does prove that single thread performance of NVME is a complicated issue. :grimacing:

From a TN-Bench result perspective, 200-400 MB/s seems to be about right for small nvme pools on DDR4/PCIE-Gen3 platforms. I can do some Gen3/Gen4 testing in my Epyc system this weekend most likely.

1 Like

I’ve always wondered if your approach to testing with dd and pseudo devices is better than using fio. I observed “unrealistic” drops in performance and significant discrepancies between read and write speeds that could not be attributed to poor pool design, PCI lane starvation, CPU load, or ARC configuration. (All of these are possible, but not probable. To the extent seen, that is.)

1 Like

People have been using dd and devices like /dev/urandom and /dev/null to test relative performance since long before fio has existed. It is the stone age original performance testing tool. The philosophy for TN-Bench is: Keep it simple, stupid, and leverage mature technology.

fio can make things “more complicated” by running mixed I/O depths and can run mixed reads/writes in the same “instance” or job, where “dd” can’t. “dd” is full throttle all the time, and it makes measuring it’s performance simpler. fio can be run in similar ways to what I am doing with dd, but I don’t see a compelling reason to switch.

Its a little unfair for you to critisize the software and make that sort of claim without sharing your results and why you think the “drops” are “unrelistic”.

Also, the closest relative to TN-Bench is more like solnet-array-test, not fio. It also used dd in a similar way to do it’s individual disk read tests. Many of us on the forums used it as the de-facto performance test, back in the day, and it was published over 10 years ago. The author claimed it to be in use since far predating TrueNAS, back to 1998.

Code:
sol.net disk array test v3

1) Use all disks (from camcontrol)
2) Use selected disks (from camcontrol|grep)
3) Specify disks
4) Show camcontrol list

Option: 2

Enter grep match pattern (e.g. ST150176): ST4

Selected disks: da3 da4 da5 da6 da7 da8 da9 da10 da11 da12 da13 da14
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 44 lun 0 (da3,pass5)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 45 lun 0 (da4,pass6)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 46 lun 0 (da5,pass7)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 47 lun 0 (da6,pass8)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 48 lun 0 (da7,pass9)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 49 lun 0 (da8,pass10)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 50 lun 0 (da9,pass11)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 51 lun 0 (da10,pass12)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 52 lun 0 (da11,pass13)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 53 lun 0 (da12,pass14)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 54 lun 0 (da13,pass15)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 55 lun 0 (da14,pass16)
Is this correct? (y/N): y
Performing initial serial array read (baseline speeds)
Tue Oct 21 08:21:23 CDT 2014
Tue Oct 21 08:26:47 CDT 2014
Completed: initial serial array read (baseline speeds)

Array's average speed is 97.6883 MB/sec per disk

Disk  Disk Size  MB/sec %ofAvg
------- ---------- ------ ------
da3  3815447MB  98  100
da4  3815447MB  90  92
da5  3815447MB  98  100
da6  3815447MB  97  99
da7  3815447MB  95  97
da8  3815447MB  82  84 --SLOW--
da9  3815447MB  87  89 --SLOW--
da10  3815447MB  84  86 --SLOW--
da11  3815447MB  97  99
da12  3815447MB  92  94
da13  3815447MB  102  104
da14  3815447MB  151  155 ++FAST++

Performing initial parallel array read
Tue Oct 21 08:26:47 CDT 2014
The disk da3 appears to be 3815447 MB.
Disk is reading at about 74 MB/sec
This suggests that this pass may take around 860 minutes

  Serial Parall % of
Disk  Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da3  3815447MB  98  86  88 --SLOW--
da4  3815447MB  90  74  82 --SLOW--
da5  3815447MB  98  82  84 --SLOW--
da6  3815447MB  97  91  95
da7  3815447MB  95  72  76 --SLOW--
da8  3815447MB  82  80  97
da9  3815447MB  87  84  96
da10  3815447MB  84  111  133 ++FAST++
da11  3815447MB  97  120  124 ++FAST++
da12  3815447MB  92  116  126 ++FAST++
da13  3815447MB  102  123  121 ++FAST++
da14  3815447MB  151  144  95

Awaiting completion: initial parallel array read
Tue Oct 21 08:39:32 CDT 2014
Completed: initial parallel array read

Disk's average time is 741 seconds per disk

Disk  Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da3  104857600000  743  100
da4  104857600000  764  103
da5  104857600000  752  101
da6  104857600000  737  99
da7  104857600000  748  101
da8  104857600000  754  102
da9  104857600000  738  100
da10  104857600000  762  103
da11  104857600000  748  101
da12  104857600000  756  102
da13  104857600000  740  100
da14  104857600000  653  88 ++FAST++

Performing initial parallel seek-stress array read
Tue Oct 21 08:39:32 CDT 2014
The disk da3 appears to be 3815447 MB.
Disk is reading at about 58 MB/sec
This suggests that this pass may take around 1093 minutes

  Serial Parall % of
Disk  Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da3  3815447MB  98  52  53
da4  3815447MB  90  48  53
da5  3815447MB  98  50  51
da6  3815447MB  97  50  52
da7  3815447MB  95  48  50
da8  3815447MB  82  48  59
da9  3815447MB  87  54  62
da10  3815447MB  84  47  56
da11  3815447MB  97  49  50
da12  3815447MB  92  50  55
da13  3815447MB  102  49  48
da14  3815447MB  151  52  34

Awaiting completion: initial parallel seek-stress array read
1 Like

I am sorry. I admire every project. I didn’t mean to hurt anyone. I was just trying to understand what you are benchmarking and why. Concerning history: The bogomips in Linux kernel from the 90ies and UNIX “byte” benchmarks of the 70ies are there as well. Nevertheless, it appears that no one currently utilizes them for comparison purposes.

To give you an idea of my results: the system in my signature did 850-900 MB/s sequential read and 250-320 MB/s sequential write on a pool configured with three vdevs (each two-way mirrors), as measured by tn-bench. I highly doubt those figures, though. fio tests showed significantly higher figures. Even a client VM writing on that pool with mixed loads via 10GbE NFS got higher output.

The culprits are those pseudo-devices and the syscalls that dd does for single-threaded copying — hence the name data duplicator. The low I/O queue depth adds to this issue. In my opinion(!), you can’t measure full bandwidth with dd. It was never intended to do so.

But then again … please, it’s just my opinion.

1 Like