Slower than expected performance on single vdev mirrored pool

Dean_Wilson · August 1, 2024, 9:55pm

I’m new to TrueNAS and have been reading about and experimenting with different pool configurations. I plan to have two pools - one for fast IOPS and one for media storage. The fast IOPS pool will store apps, vms, databases, and SMB shares. It will be a single vdev with 2 mirrored drives to start. I know that this should initially result in (close to) single-drive performance until I add more vdev mirrors at a later date. But (as far as I can tell) I’m not seeing performance even close to a single drive.

CPU: Xeon E-2334
OS: NVMe
RAM: 64GB
Pool: (2) 6TB Exos 7E10 drives (ST6000NM019B) in mirror vdev

Max sustained transfer rate for these drives is 250MB/s
The write speed with fio (shown below) is 66.1MB/s

Hard drives are new and are empty while I work on performance tweaks. Ashift is automatically set at 12. The sector size is 512 logical, but the physical size is 4096, so I’m not sure this should be 9 or 12 for these drives?

zpool get ashift TestFastIO
NAME PROPERTY VALUE SOURCE
TestFastIO ashift 12 local

fdisk -l /dev/sdg
Sector size (logical/physical): 512 bytes / 4096 bytes

DataSet settings:
Sync: Standard
Compression: LZ4 (performance is worse with no compression)
ATime: Off
RecordSize: 128KB (performance is worse with 1M recordsize?)
Dedupe: Off
Checksum: On

I used the following fio options, run from ssh on the TrueNAS box:
fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=60 --time_based --end_fsync=1

Run status group 0 (all jobs):
  WRITE: bw=63.1MiB/s (66.1MB/s), 63.1MiB/s-63.1MiB/s (66.1MB/s-66.1MB/s), io=3965MiB (4158MB), run=62860-62860msec

The fio options are admittedely a “torture test” configuration, as described on ArsTechnica. But this performance still seems slow, unless I’m misinterpreting the results? (How fast are your disks? Find out the open source way, with fio | Ars Technica)

I’ve tried changing the recordsize to 1M, and (in a separate test) disabling compression. Both tests resulted in worse numbers, which was a bit of a surprise.

Am I misinterpreting the results?
Or maybe not running the best fio test for tweaking IOPS?
Is it ok to run fio from the TrueNAS console?
Otherwise is there anything I should change to improve performance?

Thanks!

HoneyBadger · August 1, 2024, 10:19pm

Hey @Dean_Wilson - welcome to the forums.

You changed your recordsize, but did you change the parameters of the fio test?

Change your block size to to bs=128K or even bs=1M and increase your queue depth and job count - at present, you’re asking HDD to do a single stream of unqueued 4K I/O - even with ZFS in the path, the laws of physics are likely your bottleneck here, because spinning rust and random I/O don’t mix well.

Your Xeon being a quad-core, try

fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=128k --size=4g --numjobs=4 --iodepth=8 --runtime=60 --time_based --end_fsync=1

Small 2-drive mirror system:

Run status group 0 (all jobs):
  WRITE: bw=125MiB/s (131MB/s), 31.0MiB/s-31.4MiB/s (32.5MB/s-32.9MB/s), io=8149MiB (8544MB), run=65344-65346msec

8-wide Z2

Run status group 0 (all jobs):
  WRITE: bw=498MiB/s (522MB/s), 123MiB/s-126MiB/s (129MB/s-132MB/s), io=35.8GiB (38.4GB), run=73542-73553msec

4-wide Z1 SSD:

Run status group 0 (all jobs):
  WRITE: bw=809MiB/s (848MB/s), 193MiB/s-216MiB/s (202MB/s-226MB/s), io=51.5GiB (55.3GB), run=65206-65206msec

sfatula · August 1, 2024, 11:18pm

Max sustained typically means the optimal perfect case, not random io. You will not get max or close with random. IO queue depth 1, not sure why that would be desirable and will slow down your test even more. Larger recordsize typically better performance (most of the time), but not for all use cases.

Non rust better for VMs.

Dean_Wilson · August 2, 2024, 2:19am

Thank you both for the replies! I may substitute SSDs for my rust drives eventually, but until then will have to get the most I can out of optimizing the pool to be as efficient as possible.

You both mentioned that the parameters in the test should be changed. At first glance it seems strange to change the test to make the pool look better. But I guess the point isn’t to find either the most grueling test, or the most favorable conditions, but instead to use fio for a reasonable estimate of how the file system will be used.

Unfortunately, that is where I’m a little lost. The article I’d read used random IO with queue depth 1 in an admittedly tough test. But maybe it isn’t the best real-world example. Would HoneyBadger’s suggestion be more realistic for various use in SMB, vms, and apps, then?

I know that’s vague, but the pool is somewhat multipurpose due to budget constraints, and as such the use-case is also wide-ranging. With a quad-core, and the majority of the use for SMB, followed by apps/vms and then small database use, would the most ideal test use bs=1M, numjobs=4, iodepth=8?

Thanks again for all your help!

Stux · August 2, 2024, 2:31am

For SMB, there are a few scenarios…

copying lots of large files… to and from the server
copying lots of little files… to and from the server
modifying bits of large files… on the server.

Etc.

And then there is also the VM/Apps case… which stresses block level access… possibly random io.

Its better to use a flash pool for your vms… and then the rust pool for the large file shares…

a rust pool can have its small files and metadata offloaded to a separate metadata special vdev (there are caveats) which can get your the best of both worlds in a way.

In fact… it may make sense to combine the rust pool with the flash pool as above… and then your vm data would be stored on the metadata vdevs.

I’m not going to tell you what FIO commands to run, but the issue is that harddrives suck at random io. they physically have to move the head to the right track, and then wait for the sector they want to read to spin under the heads…

So, in the modern world, you want to use flash for random io.

sfatula · August 2, 2024, 6:10am

Ok, so, the reason for the change is this - you said slower than expected performance, but used non realistic fio tests and tried to compare to optimal sequential numbers (which was your expected). That’s why the change in parms, you were not simulating the conditions under which you could get optimal rated speed you wanted to compare to.

99.9% of the time, when someone uses fio and says slower than expected, it’s the test and fio parms. It’s not changing the test to make Scale look good. It’s changing the test to try and match your optimal performance which you defined as rated speed. ZFS wouldn’t use 1 queue depth on your system, so why use it it fio, etc. The internet is full of “experts” that craft tests and then don’t explain precisely what for or worse, don’t know what they are doing.

The pool will perform exactly as expected. If you want to know how your pool will perform, read this:

No need to test.

Dean_Wilson · August 3, 2024, 12:11am

Thanks again for all the helpful replies! A few weeks ago I didn’t know anything about ZFS. Then I read tons of documentation and articles and thought I had a handle on my options. But reading your suggestions as well as the guide listed above has made me re-evaluate my storage plan.

I was initially planning on the following pools:

FastIO: 2-wide mirror with 6TB drives
Media: 5-wide RaidZ2 with 20TB drives

But every response I’ve gotten has indicated that I need to be using SSDs for any pools that require fast IOPs. I hadn’t gone that route because I’ve already blown way past my budget. One option would be to drop my “FastIO” storage from 6TB to 4TB and replace the rust drives with 2 4G 870 EVO SSDs. But that would cost $700 which is a hard sell on my already blown budget.

But I might be able to use @Stux’s suggestion to use a fusion pool with metadata SSDs like this:

ComprehensivePool (1M Recordset):

5-wide RaidZ2 20TB drives
3-wide mirror 500GB 870 EVO drives (Metadata vdev with 512K storage size)

Theoretically, this should provide a good balance between fast IO for small files, good storage efficiency, and good fault tolerance.

The first thing everyone mentions when considering metadata SSDs is that if the metadata vdev dies the pool is lost. But using the RAID reliability tool (linked below) I see that a 3-wide mirror is more resilient than the RaidZ2 vdev, so I think it should be safe. And as far as I can tell, 500GB is more than I need, but they’re only $55 each on Amazon right now. $150 is much easier to swallow than $700+. (This approach also has the added benefit of allowing me to use quotas to save 6TB from the ComprehensivePool, rather than reducing the FastIO pool to 4TB.)

As @sfatula pointed out, I can rely on documented pool performance without using fio for running tests. But the tests (and your assistance!) is helping me to better understand how everything works, which will hopefully result in a well-thought out file system that I don’t regret later.

With that in mind, I used a couple SSDs I have at home to set up the following pool:

TestPool (1M Recordset)

4-wide RaidZ2 20TB drives (this should have been 5, but one of my newly purchased drives was bad and is being replaced!)
(optional) 2-wide mirror 128Gb Lite-On SSD (Metadata vdev with 512K storage size)

I ran fio tests (below) against the TestPool, first without the MetaData, and then the same tests with the metadata. I was surprised that the tests with metadata were a mixed bag. I’m guessing this is the result of one of two things:

The SSD drives are old and slow, with 512K sectors (TestPool has aShift = 12). I’m sure that this results in a sub-optimal configuration
I may still have the fio tests running poorly. I ran both read and write, as well as 128k and 1M block size. (I know that with a 1M recordsize pool, the 128k tests will also be suboptimal, but so will the real-world use-cases, so I thought it might still be a useful comparison.) I also bumped the size to 16g as I’d read that larger file sizes may help keep the OS and/or TrueNAS from optimizing the transfer and masking the results.

Tests:

Random Read 128K
fio --name=random-read128k --ioengine=posixaio --rw=randread --bs=128k --size=16g --numjobs=4 --iodepth=8 --runtime=60 --time_based --end_fsync=1

Random Read 1M
fio --name=random-read1M --ioengine=posixaio --rw=randread --bs=1M --size=16g --numjobs=4 --iodepth=8 --runtime=60 --time_based --end_fsync=1

Random Write 128K
fio --name=random-write128 --ioengine=posixaio --rw=randwrite --bs=128k --size=16g --numjobs=4 --iodepth=8 --runtime=60 --time_based --end_fsync=1

Random Write 1M
fio --name=random-write1M --ioengine=posixaio --rw=randwrite --bs=1M --size=16g --numjobs=4 --iodepth=8 --runtime=60 --time_based --end_fsync=1

	No Metadata	With Metadata
Read 128k	READ: bw=1624MiB/s (1703MB/s)	READ: bw=33.3MiB/s (34.9MB/s)
Read 1M	READ: bw=3974MiB/s (4167MB/s)	READ: bw=6995MiB/s (7335MB/s)
Write 128k	WRITE: bw=202MiB/s (212MB/s)	WRITE: bw=38.8MiB/s (40.6MB/s)
Write 1M	WRITE: bw=426MiB/s (446MB/s)	WRITE: bw=38.8MiB/s (40.6MB/s)

For the most part, the results with metadata are significantly worse. Is this likely due to poorly performing SSDs or should I tweak fio for testing a pool with a metadata vdev?

Alternatively I can just go ahead and buy the 3 SSD drives, but I wanted to run the idea by the forun first for any opinions.

For reference:

Here is the raid reliability calculator, which I thought was interesting! R2-C2
I also used the ‘find’ command as outlined here to look at my existing (non-TrueNAS) storage to ensure that 500GB metadata would be more than enough: ZFS Metadata Special Device: Z - L1 Articles & Video-related - Level1Techs Forums

Thanks again for your continued help!