Bad SMB write/read performance with 4 drives in 2x mirror configuration (RAID10)

I know we have been discussing this, but the problem occurs with 1 vDev 2 mirror configuration, and I really do think we should concentrate on the simplest configuration where the problem exists in order to maximise the chance of persuading the TrueNAS and / or ZFS and / or Debian and / or Linux kernel folks that there is a real problem.

Version: TrueNAS CORE 13.3
System: “High-spec” system (Ryzen 9 5950X, 128GB RAM)
Data drives: 4x 20TB Exos X20 (CMR, SATA 6 Gbit/s)
OS drive: 1x 256 GB SSD (SATA 6 Gbit/s)
Compression: Dataset/pool compression turned OFF
Dataset share type: SMB

Commands used:

zfs set primarycache=metadata nas_data1
zfs set prefetch=none nas_data1
mkdir /mnt/nas_data1/benchmark_test_pool/1
mkdir /mnt/nas_data1/benchmark_test_pool/5
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/nas_data1/benchmark_test_pool/1 --rw=read --bs=1M --size=50G --numjobs=1 --time_based --runtime=60
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/nas_data1/benchmark_test_pool/5 --rw=read --bs=1M --size=10G --numjobs=5 --time_based --runtime=60
zfs set prefetch=all nas_data1
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/nas_data1/benchmark_test_pool/1 --rw=read --bs=1M --size=50G --numjobs=1 --time_based --runtime=60
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/nas_data1/benchmark_test_pool/5 --rw=read --bs=1M --size=10G --numjobs=5 --time_based --runtime=60
zfs set primarycache=all nas_data1
rm -r /mnt/nas_data1/benchmark_test_pool/*

Test results

"2 disk MIRROR" config (aka 1 vdev with 2x drives in that vdev):
1 job with no prefetch: READ: bw=262MiB/s (275MB/s), 262MiB/s-262MiB/s (275MB/s-275MB/s), io=15.3GiB (16.5GB), run=60001-60001msec
5 jobs with no prefetch: READ: bw=234MiB/s (246MB/s), 38.1MiB/s-53.0MiB/s (40.0MB/s-55.6MB/s), io=13.7GiB (14.7GB), run=60001-60019msec

1 job with prefetch: READ: bw=278MiB/s (291MB/s), 278MiB/s-278MiB/s (291MB/s-291MB/s), io=16.3GiB (17.5GB), run=60001-60001msec
5 jobs with prefetch: READ: bw=391MiB/s (410MB/s), 69.1MiB/s-85.0MiB/s (72.4MB/s-89.1MB/s), io=22.9GiB (24.6GB), run=60010-60090msec


"RAID 10" config (aka 2 vdevs with 2x drives per vdev, aka striped mirrors):
1 job no prefetch: READ: bw=504MiB/s (529MB/s), 504MiB/s-504MiB/s (529MB/s-529MB/s), io=29.6GiB (31.7GB), run=60001-60001msec
5 jobs no prefetch: READ: bw=378MiB/s (397MB/s), 70.6MiB/s-80.1MiB/s (74.1MB/s-84.0MB/s), io=22.2GiB (23.8GB), run=60001-60016msec

1 job with prefetch: READ: bw=542MiB/s (568MB/s), 542MiB/s-542MiB/s (568MB/s-568MB/s), io=31.8GiB (34.1GB), run=60001-60001msec
5 jobs with prefetch: READ: bw=718MiB/s (753MB/s), 137MiB/s-152MiB/s (143MB/s-159MB/s), io=42.1GiB (45.2GB), run=60003-60047msec

The thread has a name, but I see no real reason to go away from it. If you have the possibility of doing a real test with the simplest configuration you mentioned, we can discuss it here with participants doing real tests. If not, it would be helpful we stayed with the original. Your input is, for sure, appreciated if it helps to isolate the problem without raising the complexity or repeating tests more than necessary with a new topology. Most of the participants spend a lot of time showing what’s going on. It’s not for entertainment; it’s to find a solution.

Sorry @nvs, I could not find the initial test with the default settings with 1 job for comparison.

With multiple jobs, we see much faster BW, prefetching, and how it works can be the Key?

@nvs

This is pretty much what we would expect to see when you add a 2nd vDev i.e. that you get double the throughput though there is one oddity…

The 5 jobs no pre-fetch actually goes down rather than doubling when you add the 2nd vDev. I am not sure this is explained by the additional seeks - it’s very odd TBH. But with prefetch you see the numbers jump up again because the prefetch is using the bandwidth that didn’t get used because the numbers were unusually low without it.

With hindsight for the 2x vDev measurements I would gave liked to have also seen an additional 10 jobs 10GB data measurement which would be equivalent to the same I/O per vDev.

Unfortunately what we don’t have measurements for is a single vDev single disk, so we cannot compare what happens when you make a single disk into a mirror.

The thread has a name, but as I have explained there is a real reason to go away from it. This subject has already morphed from a problem copying data across a network from Windows using SMB to a ZFS-based disk-oriented problem. Those who have put the time and effort into running benchmarks have demonstrated that there is a problem, and if we want it to be fixed we need to narrow down the cause. And that is why I am suggesting sticking to a single vDev and comparing single drive and mirrors.

As I have already stated, I was asked how to run multiple jobs, so I created and tested a script to do this. Then someone wanted to see my results and I when I did so I made it clear that these results were NOT relevant to the subject of this post because it wasn’t a mirror. I am absolutely NOT trying to introduce a new topology - I am suggesting that we narrow it down to the simplest possible topology in which we can reproduce the problem.

My script runs the same test with 1 job and then runs a similar test with 5 jobs. And it does these two tests first with prefetch off and then with prefetch on.

I would like to speculate a little: if the parallel task is, in reality, not really parallel (in distributed portions), then it’s possible a latter task can read something from cache and only a portion from disk, and so on. This will suggest a higher disk read performance. Unfortunately, it is an illusion.

Then multiple reads ad an additional level of complexity.

No doubt, your script is okay. It was only that TrueNas doing by default. After install.

My “high-spec” machine is now back into normal operations again. I have done the same test as previously also on my “low-spec” system now. In case this is interesting for anyone, see below.
Note that this on now on TrueNAS SCALE AND a less powerful machine, so lower speeds may be related to HW and/or moving to SCALE (even though I wouldnt expect a big difference normally).

Version: TrueNAS SCALE (Dragonfish-24.04.2.2)
System: “Low-spec” system (Intel Xeon CPU E5620 @ 2.40GHz, 64GB RAM, ECC)
Data drives: 4x 20TB Exos X20 (CMR, SATA 3 Gbit/s)
OS drive: 1x 256 GB SSD (SATA 3 Gbit/s)
Compression: Dataset/pool compression turned OFF
Dataset preset: SMB

Commands used (same as before):

sudo zfs set primarycache=metadata nas_data1
sudo zfs set prefetch=none nas_data1
mkdir /mnt/nas_data1/benchmark_test_pool/1
mkdir /mnt/nas_data1/benchmark_test_pool/5
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/nas_data1/benchmark_test_pool/1 --rw=read --bs=1M --size=50G --numjobs=1 --time_based --runtime=60
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/nas_data1/benchmark_test_pool/5 --rw=read --bs=1M --size=10G --numjobs=5 --time_based --runtime=60
sudo zfs set prefetch=all nas_data1
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/nas_data1/benchmark_test_pool/1 --rw=read --bs=1M --size=50G --numjobs=1 --time_based --runtime=60
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/nas_data1/benchmark_test_pool/5 --rw=read --bs=1M --size=10G --numjobs=5 --time_based --runtime=60
sudo zfs set primarycache=all nas_data1
rm -r /mnt/nas_data1/benchmark_test_pool/*

Test results

"2 disk MIRROR" config (aka 1 vdev with 2x drives in that vdev):
1 job with no prefetch: READ: bw=109MiB/s (115MB/s), 109MiB/s-109MiB/s (115MB/s-115MB/s), io=6557MiB (6876MB), run=60003-60003msec
5 jobs with no prefetch: READ: bw=195MiB/s (205MB/s), 35.3MiB/s-40.7MiB/s (37.0MB/s-42.6MB/s), io=11.4GiB (12.3GB), run=60004-60024msec

1 job with prefetch: READ: bw=278MiB/s (292MB/s), 278MiB/s-278MiB/s (292MB/s-292MB/s), io=16.3GiB (17.5GB), run=60001-60001msec
5 jobs with prefetch: READ: bw=354MiB/s (372MB/s), 20.2MiB/s-111MiB/s (21.2MB/s-117MB/s), io=20.9GiB (22.4GB), run=60002-60339msec


"RAID 10" config (aka 2 vdevs with 2x drives per vdev, aka striped mirrors):
1 job no prefetch: READ: bw=128MiB/s (135MB/s), 128MiB/s-128MiB/s (135MB/s-135MB/s), io=7699MiB (8073MB), run=60005-60005msec
5 jobs no prefetch: READ: bw=304MiB/s (318MB/s), 58.9MiB/s-62.0MiB/s (61.8MB/s-65.0MB/s), io=17.8GiB (19.1GB), run=60010-60043msec

1 job with prefetch: READ: bw=532MiB/s (557MB/s), 532MiB/s-532MiB/s (557MB/s-557MB/s), io=31.2GiB (33.4GB), run=60001-60001msec
5 jobs with prefetch: READ: bw=620MiB/s (650MB/s), 122MiB/s-127MiB/s (128MB/s-133MB/s), io=36.4GiB (39.0GB), run=60008-60093msec


And, as requested by @Pretoria still, "Single disk" config (aka 1 vdev with 1x drive in it):
1 job no prefetch: READ: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=6412MiB (6723MB), run=60001-60001msec
5 jobs no prefetch: READ: bw=155MiB/s (163MB/s), 29.9MiB/s-31.3MiB/s (31.3MB/s-32.9MB/s), io=9310MiB (9762MB), run=60004-60021msec

1 job with prefetch: READ: bw=255MiB/s (267MB/s), 255MiB/s-255MiB/s (267MB/s-267MB/s), io=14.9GiB (16.0GB), run=60001-60001msec
5 jobs with prefetch: READ: bw=244MiB/s (256MB/s), 48.2MiB/s-49.3MiB/s (50.5MB/s-51.7MB/s), io=14.3GiB (15.4GB), run=60003-60150msec
1 Like

Thanks.

Comparing 1 vDev 1 disk with 1 vDev 2-disk mirror we can clearly see that we are NOT getting twice the throughput with a mirror under SCALE (and we need to see the same on CORE so we can compare the two):

1 disk,  no prefetch, 1 job: 107MiB/s
2 disks, no prefetch, 1 job: 109MiB/s - only +2% 

Not much more throughput - probably constrained by one job

1 disk,  no prefetch, 5 jobs: 155MiB/s
2 disks, no prefetch, 5 jobs: 195MiB/s - only +25%

I would have expected +100% not just +25%. It is possible that we might need more than just 5 jobs, but with only 2 disks, I would be surprised that 5 jobs doesn’t keep it busy.

However, I think this and the equivalent measurements with prefetch should be enough to demonstrate that there is a problem with mirrors - and IMO most likely in ZFS.

1 disk,  prefetch, 1 job: 255MiB/s
2 disks, prefetch, 1 job: 278MiB/s - only +9%!!

Probably still constrained by a single job.

1 disk,  prefetch, 5 job: 244MiB/s
2 disks, prefetch, 5 job: 354MiB/s - +45% 

Again, I would have expected +100%. Again perhaps 5 jobs is not enough, but I would still be surprised.

Comparing 5 jobs with and without prefetch:

1 disk,  no prefetch, 5 jobs: 155MiB/s
1 disk,  prefetch, 5 job: 244MiB/s
2 disks, no prefetch, 5 jobs: 195MiB/s
2 disks, prefetch, 5 job: 354MiB/s

These show that might not have been maxing out the disk bandwidth without pre-fetch - though I guess it is possible that pre-fetch I/Os are simply a lot more efficient than normal I/Os (because ZFS can do clever stuff for pre-fetches). We should perhaps try 10 jobs and see if that gets more throughput, however am sceptical that 10 jobs will do that much better.

2 Likes
"2 disk MIRROR" config (aka 1 vdev with 2x drives in that vdev):
1 job with no prefetch: READ: bw=109MiB/s (115MB/s), 109MiB/s-109MiB/s (115MB/s-115MB/s), io=6557MiB (6876MB), run=60003-60003msec
"RAID 10" config (aka 2 vdevs with 2x drives per vdev, aka striped mirrors):
1 job no prefetch: READ: bw=128MiB/s (135MB/s), 128MiB/s-128MiB/s (135MB/s-135MB/s), io=7699MiB (8073MB), run=60005-60005msec
And, as requested by @Pretoria still, "Single disk" config (aka 1 vdev with 1x drive in it):
1 job no prefetch: READ: bw=107MiB/s (112MB/s), 107MiB/s-107MiB/s (112MB/s-112MB/s), io=6412MiB (6723MB), run=60001-60001msec

If we compare this data, we see very similar read speeds, far away from theory, and n times read gain. Thanks, @nvs This shows the problem we are discussing here and is not masked by prefetch, caching and so on because what you are not reading can’t be cached.
Now prefetching comes into play.
And again, how does it work? Why is it not using the full possible BW with a single read request?

1 Like

Isn’t that what I just said for mirror vs. non-mirror?

If we compare 1 vDev with 2 vDevs for 5 jobs no prefetch:

1 vDev,  mirrored, no prefetch, 5 jobs: 195MiB/s
2 vDevs, mirrored, no prefetch, 5 jobs: 304MiB/s - +56%

So we are seeing more I/Os with 2 vDevs but not 100% more as we might expect. This may be due to insufficient jobs to max out the bandwidth.

But also, we should probably have 10 jobs @ 10GB for the 2 vDevs vs. 5 jobs @ 10GB for 1 vDev in order to scale the data and the jobs with the number of vDevs.

Note: We are also assuming that there is no significant contention for other resources due to having more disks i.e. CPU or PCIe lanes or other internal buses. I don’t think this is the case, but it will depend on the specific hardware used.

Right, and with 1 job, we are far away from the HW possibilities.

Looking at the fio parameters again, we are using 1MB block sizes - question is whether we need to match this to the dataset block size?

I would not go any longer in the circle. For me, the problem is more than identified.

I can not see someone with more in-depth skills is popping up to find a solution.

The problem exists; the guidelines are wrong regarding mirrors…

I have to say thank you to @nvs and @simonj, for all the patience and work to generate this clear picture out of the fog.

@protopia You are a very highly ranked member here; you have kept us more than busy the last 2 weeks; hopefully it was not only entertainment. Please use your authority to find someone who is able to find a solution, or someone who has the authority to change the obviously wrong guidelines. For questions during the problem-solving process, I’m available.

Thank you all

1 Like

Or…

“The problem exists and it needs to be reported and fixed.”

And this thread started off saying that SCALE was slower than CORE. Now that we have narrowed it down, IMO we still need to establish whether this is the case. And possibly to narrow it down a little futher.

Great - then you don’t need to participate any further. But this is NOT your thread - it is @nvs ’ thread, and only he can say whether we have resolved his problem. I for one do not appreciate you trying to shut down other people’s efforts!!!

Personally I am less sanguine that we have narrowed it down sufficiently to report it, but I do think that we are close.

I think we need to get to the point where we can show that we have maxed out the disk reads in each category and checked that we haven’t done anything silly e.g. with blocksizes.

:rofl: - I am not highly ranked at all :3rd_place_medal:, and I haven’t a clue where you got that idea. Nor do I have any clue as to why you think :brain: I was not taking this seriously :clown_face: and doing it for “only entertainment” :clap:.

If I am highly ranked for anything :1st_place_medal:, I guess it must be in bullshitting then :poop:. Or perhaps my 50 years IT :desktop_computer: experience in both technical and management roles has somehow resulted in a certain level of knowledge :bulb:.

:rofl: :rofl:

Authority? What authority?

I don’t work on any of the Linux, Debian, ZFS or TrueNAS support teams or organisations. I have some Linux experience (Ubuntu on rPi 400), and some prior RAID experience, I have 1 TrueNAS SCALE system that I built last year but no CORE experience, but a heck of a lot of general IT infrastructure experience over several decades.

:rofl::rofl::rofl:

People can read back and see for themselves the essential (if not crucial) role you have had in this thread.

2 Likes

I will let Protopia handle. You are right, I do not think the issue exists (other than guideline). I could be wrong, but I have my reasons.

1 Like

I guess, I understand

Thanks