Bad SMB write/read performance with 4 drives in 2x mirror configuration (RAID10)

Protopia · October 6, 2024, 1:15pm

As I previously commented, this statement really depends on what you are trying to measure.

When you use dd to read a large file, you have only a single thread requesting data, and it will do so sequentially. This will trigger sequential read-ahead which will read a few blocks ahead of the one last requested. Throughput on a single disk vDev will depend on the slowest of disk throughput or a single-core CPU dd process itself (likely to be disk). However, in a mirror, you need to know how these disk reads will take advantage of the mirrors to do the reads in parallel. Will sequential read ahead alternate between disks or not?

The reason I suggested that you need multiple read streams is to avoid these needing to know these types of answer. If you have two or more dd streams in parallel, then you are much more likely to hit read throughput as the constraint, keep the disks running at 100% utilisation, as the various dd instances will run in parallel on multiple cores.

When you use dd to write a large file, then all writes are made to both drives, so parallelism is less important to get both drives running at 100%.

In both cases, I think you should look at disk stats (utilisation, throughtput) rather than dd stats.

nvs · October 6, 2024, 1:26pm

Please provide the needed fio command in that case and I’m happy to try that tomorrow. If what you want to have tested was already covered by the commands Simon sent then you can already find the results of that test (incl. iostat output) in my previous reply here: Bad SMB write/read performance with 4 drives in 2x mirror configuration (RAID10) - #70 by nvs

Protopia · October 6, 2024, 1:44pm

As I have said previously, my input is based on previous general performance testing experience and not on specific performance testing of Debian or TrueNAS.

But if you are currently doing a single dd command reading a single file running normally in a shell, then all you need to do is to copy the file without block cloning so you have a second copy and from two shell windows run a dd command against each of the files.

You also need to use the Disk I/O reports to get the throughput measurements and not use the dd throughput. If you start with one dd command and keep increasing the number of parallel dd commands until the graph stops increasing you will likely have found the maximum throughput.

nvs · October 6, 2024, 1:57pm

Sorry… But I’m not following what you mean with “without block cloning” and running against the two files from two shells. If you can provide some clear steps it would help a lot!

Is this test representative of how TrueNAS would do this with a single SMB user requesting to read a large 195 GB file? Maybe this is testing if multiple users would read different files (not my usecase)?

simonj · October 6, 2024, 3:02pm

@nvs if you are not so comfortable using shell commands you can also use AJA System Test on a client. I use that a lot. Important to disable compression as AJA writes compressible data . Or for fully real-world tests write and read video files using Davinci Resolve. Observe what your disks read and write with zpool iostat -v 1 . Or you can start a scrub and watch the read speed with zpool iostat.

Another hint and I really don’t mean it in an offensive way. A lot of your questions on how to use the commands for benchmarking could be answered by ChatGPT. I use it a lot. Just much faster than keeping a cheatsheet or googling, stackexchange, etc.

nvs · October 6, 2024, 3:10pm

Hi Simon,

No problem using shell here but being two weeks into testing I want to be absolutely clear on applying the correct steps/commands to be correct and in line with other users tests/to be useful and not having to do it again. As said, I will only run this fio test tomorrow still and then take the machine back into normal operation and that will likely be the end of this adventure for me. So this is the last opportunity to do this correctly on this system tomorrow.

Appreciate the hint on ChatGPT, but I have seen that produce not always correct output in the past (actually quite often) when it comes to similar things. And at the stage where I am, I want to be absolutely sure we are on the same page on doing identical testing/applying the correct commands. I hope you get my point.

sfatula · October 6, 2024, 3:26pm

I just have a comment. Why are you using numjobs=1 for fio? When IX says read performance will scale with vdevs, do you think they mean with one process only? I would think it means on a normal system that is operating, and a normal system has many processes doing many things. Not 1 job.

I have not read the whole thread. My other comment is just wanting to make sure when you add a vdev, you rebalance the pool as otherwise, everything is still on the one vdev that was there before adding.

B52 · October 6, 2024, 3:32pm

Hi @nvs or @simonj can you please test the read speed with prefetch on and off. Because I remember a very old thread in another forum (can’t find it any more) it had a huge influence of 2 vdev 4 drive mirror read speed. I’m not a command line expert.

simonj · October 6, 2024, 4:13pm

Yeah. Prefetch on/off makes a huge difference:

root@workhorse[.../work_mirrored/999_DIV/999_speedtests]# zfs set primarycache=metadata work
root@workhorse[.../work_mirrored/999_DIV/999_speedtests]# zfs set prefetch=none work
root@workhorse[.../work_mirrored/999_DIV/999_speedtests]# fio --name TESTSeqWriteRead --eta-newline=5s --filename=fio-tempfile-WSeqARC-OFF_NEW.dat --rw=read --bs=1M --size=50G --numjobs=1 --time_based --runtime=60
TESTSeqWriteRead: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
...
Run status group 0 (all jobs):
   READ: bw=68.5MiB/s (71.9MB/s), 68.5MiB/s-68.5MiB/s (71.9MB/s-71.9MB/s), io=4114MiB (4314MB), run=60022-60022msec

root@workhorse[.../work_mirrored/999_DIV/999_speedtests]# zfs set prefetch=all work 
root@workhorse[.../work_mirrored/999_DIV/999_speedtests]# fio --name TESTSeqWriteRead --eta-newline=5s --filename=fio-tempfile-WSeqARC-OFF_NEW.dat --rw=read --bs=1M --size=50G --numjobs=1 --time_based --runtime=60
TESTSeqWriteRead: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
...
Run status group 0 (all jobs):
   READ: bw=1634MiB/s (1714MB/s), 1634MiB/s-1634MiB/s (1714MB/s-1714MB/s), io=95.8GiB (103GB), run=60001-60001msec

Protopia · October 6, 2024, 5:41pm

I had time to have a play myself on my 5-wide RAIDZ1 and here is the script I used.

zfs set primarycache=metadata hdd-pool
zfs set prefetch=none hdd-pool
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest/1 --rw=read --bs=1M --size=50G --numjobs=1 --time_based --runtime=60
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest/5 --rw=read --bs=1M --size=10G --numjobs=5 --time_based --runtime=60
zfs set prefetch=all hdd-pool
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest/1 --rw=read --bs=1M --size=50G --numjobs=1 --time_based --runtime=60
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest/5 --rw=read --bs=1M --size=10G --numjobs=5 --time_based --runtime=60
zfs set primarycache=all hdd-pool
rm -r /mnt/hdd-pool/disktest/*

I am explicit about where the test files are put, different directories for different parallel jobs, no filename so that it can create 5 files for 5 jobs. You need to change the pool and the directory names and create the directories it needs before running this.

EDIT: On reflection /mnt/hdd-pool/disktest should probably be a separate dataset so that you can see dataset parameters such as compression to off.

nvs · October 6, 2024, 6:18pm

Thanks! Just for total clarity, these fio commands I should run after each other now (not in parallel as suggested before), correct? And while running them, towards the end of each fio test note down the output of the iostat command.

Thx

B52 · October 6, 2024, 6:20pm

Hi @Protopia Protopia please share your results.

B52 · October 6, 2024, 7:06pm

I don’t think prefetching makes sense with parallel jobs. Not sure. @Protopia can bring light in the dark showing his test resultst. My estimation is prefetching with a single job is efficient, with multiple jobs perfomance drops.

Protopia · October 6, 2024, 8:16pm

On a 5x RAIDZ1 on Ironwolf 4TB:

1 job no prefetch:    READ: bw=456MiB/s (478MB/s), 456MiB/s-456MiB/s (478MB/s-478MB/s), io=26.7GiB (28.7GB), run=60001-60001msec
5 jobs no prefetch:   READ: bw=316MiB/s (332MB/s), 60.2MiB/s-64.9MiB/s (63.2MB/s-68.0MB/s), io=18.6GiB (19.9GB), run=60004-60057msec
1 job with prefetch:  READ: bw=655MiB/s (687MB/s), 655MiB/s-655MiB/s (687MB/s-687MB/s), io=38.4GiB (41.2GB), run=60002-60002msec
5 jobs with prefetch: READ: bw=500MiB/s (524MB/s), 88.5MiB/s-105MiB/s (92.8MB/s-110MB/s), io=29.4GiB (31.5GB), run=60010-60164msec

Protopia · October 6, 2024, 8:24pm

The numjobs=5 runs 5 tasks in parallel for you. So yes - in parallel as suggested before.

Easy to to a matrix of runs, so I did it with pre-fetch. It seems to give a proportionately similar boost with 5 jobs as it did with 1. I am assuming that 5 jobs has worse performance because of seeks.

But I have no idea why prefetch is still beneficial when I do this.

Protopia · October 6, 2024, 8:44pm

Reviewing the netdata stats for ZFS ARC, there were a LOT of ARC hits during this period - either c. 50% or c. 75% depending on which graph you look at. Whilst this is much lower than the 99%+ I get in normal usage, it could simply be the metadata being reused repeatedly. However I am not fully convinced that setting primarycache=metadata cleared existing ARC cache or prevented new data caching.

EDIT: Later graphs separated out metadata and data ARC rates and these clearly showed that data caching was off, and also you could see when pre-fetch was on. So I now think the prefetch/caching commands worked as expected and that the measurements are therefore done right.

sfatula · October 6, 2024, 9:04pm

But imagine a production environment with many things going on, and multiple mirror vdevs (say 3), with blocksize 1m. Imagine files that fit in one block, so, not streaming. So, some files might come from vdev1, some might come from vdev2, some might come from vdev 3, all reads do not start from a vdev 1 of course. And within those, least busy drive (it’s not round robin) will handle the read. So, in that case, you get far more performance benefit (measured) than say a single simple media stream only. It all depends what exactly you are measuring. The comment about multiple vdevs getting twice the performance, you may or may not get twice, doubtful. But it all depends on settings, workload, etc, you could.

These tests here are more believable in my book, and you have already mentioned it:

https://calomel.org/zfs_raid_speed_capacity.html

He didn’t get 2x, but again, depends on workload.

Prefetch is of course on by default, but does not prefetch everything, there is logic to it. I would leave it on in most cases.

If you are taking the IX document literally, then, it is not what is going to happen, at least with the tests you are doing. Real worlds tests show as you do. I don’t believe you are finding anything wrong at all, but what is expected.

Protopia · October 6, 2024, 9:22pm

This is not a general performance issue we are investigating - it is a very specific issue with mirrored drives where a mirror should perform reads better than a single drive, and it doesn’t.

To prove that there is an issue we are trying to get to the simplest situation where we can demonstrate this, and that is a single vDev with one drive or 2 or more mirrors.

Please let’s stay focused on proving this so that we can hand it off to a support team who can reproduce the issue and so will take it on to diagnose the cause and create a fix.

P.S. My own performance measurements were provided only as an incidental - I don’t have a mirror on my system nor any spare drives and so I cannot test the specific issue. I was simply providing a script and someone then asked to see my measurements.

sfatula · October 6, 2024, 10:05pm

Ok, I read an awful lot about what I was commenting on here, but know that you are on it. I do have mirrored single vdev, so if you end up with any commands you want to compare to one of the contributors having the issue, let me know, be happy to also test for comparison purposes.

B52 · October 7, 2024, 9:49am

Hi @sfatula, we had a long discussion here, but unfortunately, we spent a lot of time with someone without the possibility to check it with real HW. Anyhow, he seems convinced that the problem exists.

We are discussing here the performance of the 2 vdev 4 drive mirror read speed and mirrors generally. Please help!

Following your posts, you are able to check them on real HW, and additionally, you are not believing the problem exists. So if you can repeat it, we are one step further.