Bad SMB write/read performance with 4 drives in 2x mirror configuration (RAID10)

B52 · October 2, 2024, 12:39pm

The read speed with 4 disks for the RAID 10 and a 4 disk stripe should be theoretically very similar, if I understand it right.

B52 · October 2, 2024, 1:08pm

Sorry was a typo is edited now

nvs · October 2, 2024, 1:16pm

Indeed, also how I understand it: RAID10 should theoretically have near the read performance of a 4 disk STRIPE / 4 disk MIRROR. I.e. 4x read performance of a single drive.

For completion, I’ve just run the same tests as before on a 4 disk MIRROR pool.

dd results:

# WRITE TEST:
dd if=/dev/zero of=/mnt/nas_data1/benchmark_test_pool/tmp.dat bs=1024k count=195k
199680+0 records in
199680+0 records out
209379655680 bytes (209 GB, 195 GiB) copied, 878.251 s, 238 MB/s

# READ TEST:
dd if=/mnt/nas_data1/benchmark_test_pool/tmp.dat of=/dev/null bs=1024k count=195k
199680+0 records in
199680+0 records out
209379655680 bytes (209 GB, 195 GiB) copied, 320.144 s, 654 MB/s

fio results (usual disclaimer as before):

# WRITE TEST:
fio --name TESTSeqWrite --eta-newline=5s --filename=fio-tempfile-WSeq.dat --rw=write --size=500m --io_size=195g --blocksize=1024k --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
WRITE: bw=918MiB/s (963MB/s), 918MiB/s-918MiB/s (963MB/s-963MB/s), io=68.4GiB (73.4GB), run=76213-76213msec

# READ TEST:
fio --name TESTSeqRead --eta-newline=5s --filename=fio-tempfile-RSeq1.dat --rw=read --size=500m --io_size=195g --blocksize=1024k --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
READ: bw=20.0GiB/s (21.5GB/s), 20.0GiB/s-20.0GiB/s (21.5GB/s-21.5GB/s), io=195GiB (209GB), run=9750-9750msec

Considering dd results:

Write speed of 238 MB/s looks OK I guess, if we consider some overhead (could be a bit higher maybe).
Read speed is only 654 MB/s. That seems suspiciously low to me? I expected this to approach more towards the 1.1 GB/s that we have also seen in 4 disk STRIPE. It seems to be exactly half of that, weirdly enough. It seems its performing as if it created a 2 disk mirror, not 4 disks. Can anyone chip in on this if this is expected performance? If not, what may be causing this? This may be a very good lead what might result in the low read performance we see in RAID10!

B52 · October 2, 2024, 1:44pm

Is it the same problem? It has the Solved mark.

Mysterious read speed cap around 300MB/s on both pools (4xNVME stripe, 24xHDD striped mirrors) in beefy server

nvs · October 2, 2024, 2:03pm

OK, one more with 2 disk MIRROR and then I think I’ve covered the most relevant pool configurations. Was curious what we could get here given the weird read speed results in the previous 4 disk mirror test.

dd results:

# WRITE TEST:
dd if=/dev/zero of=/mnt/nas_data1/benchmark_test_pool/tmp.dat bs=1024k count=195k
199680+0 records in
199680+0 records out
209379655680 bytes (209 GB, 195 GiB) copied, 796.85 s, 263 MB/s

# READ TEST:
dd if=/mnt/nas_data1/benchmark_test_pool/tmp.dat of=/dev/null bs=1024k count=195k
199680+0 records in
199680+0 records out
209379655680 bytes (209 GB, 195 GiB) copied, 710.596 s, 295 MB/s

fio results (usual disclaimer as before):

# WRITE TEST:
fio --name TESTSeqWrite --eta-newline=5s --filename=fio-tempfile-WSeq.dat --rw=write --size=500m --io_size=195g --blocksize=1024k --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
WRITE: bw=1297MiB/s (1360MB/s), 1297MiB/s-1297MiB/s (1360MB/s-1360MB/s), io=87.9GiB (94.4GB), run=69386-69386msec

# READ TEST:
fio --name TESTSeqRead --eta-newline=5s --filename=fio-tempfile-RSeq1.dat --rw=read --size=500m --io_size=195g --blocksize=1024k --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
READ: bw=19.9GiB/s (21.4GB/s), 19.9GiB/s-19.9GiB/s (21.4GB/s-21.4GB/s), io=195GiB (209GB), run=9796-9796msec

Considering dd results:

Write speed looks fine at 263 MB/s.
Read speed with only 295 MB/s much lower than expected also here! This is essentially single disk read speed, instead of 2 disk. So, as was also seen with the 4 disk MIRROR read speed test before, this looks very unexpected to me. I would have expected ~550 MB/s for this test (maybe a bit less with overhead).

nvs · October 2, 2024, 2:05pm

I guess “solution” being rolling back to TrueNAS 13.3? Let me do that. Will report back.

nvs · October 2, 2024, 3:05pm

Now on TrueNAS CORE 13.3. RAID10 configuration, with compression disabled on dataset as during previous tests.

dd results:

# WRITE TEST:
dd if=/dev/zero of=/mnt/nas_data1/benchmark_test_pool/tmp.dat bs=1024k count=195k
199680+0 records in
199680+0 records out
209379655680 bytes transferred in 461.151668 secs (454036427 bytes/sec)
So meaning ~454 MB/s.

# READ TEST:
dd if=/mnt/nas_data1/benchmark_test_pool/tmp.dat of=/dev/null bs=1024k count=195k
199680+0 records in
199680+0 records out
209379655680 bytes transferred in 370.700229 secs (564822029 bytes/sec)
So meaning ~565 MB/s.

In comparison, on TrueNAS SCALE STABLE we got: Write: 463 MB/s, Read: 566 MB/s. So very similar.

fio results (usual disclaimer as before):

# WRITE TEST:
fio --name TESTSeqWrite --eta-newline=5s --filename=fio-tempfile-WSeq.dat --rw=write --size=500m --io_size=195g --blocksize=1024k --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
WRITE: bw=2407MiB/s (2524MB/s), 2407MiB/s-2407MiB/s (2524MB/s-2524MB/s), io=186GiB (199GB), run=78947-78947msec

# READ TEST:
fio --name TESTSeqRead --eta-newline=5s --filename=fio-tempfile-RSeq1.dat --rw=read --size=500m --io_size=195g --blocksize=1024k --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
READ: bw=25.7GiB/s (27.5GB/s), 25.7GiB/s-25.7GiB/s (27.5GB/s-27.5GB/s), io=195GiB (209GB), run=7600-7600msec

Considering dd results:

Read and write speeds are almost identical to what was measured with dd on TrueNAS SCALE STABLE before.

So switching to TrueNAS CORE 13.3 does not seem to help unfortunately.

nvs · October 2, 2024, 3:24pm

Still on TrueNAS CORE 13.3, now testing 4 disk STRIPE. Same test settings as before.

dd results:

# WRITE TEST:
dd if=/dev/zero of=/mnt/nas_data1/benchmark_test_pool/tmp.dat bs=1024k count=195k
199680+0 records in
199680+0 records out
209379655680 bytes transferred in 197.849916 secs (1058275179 bytes/sec)
So meaning 1058 MB/s. Equals ~1.1 GB/s.

# READ TEST:
dd if=/mnt/nas_data1/benchmark_test_pool/tmp.dat of=/dev/null bs=1024k count=195k
199680+0 records in
199680+0 records out
209379655680 bytes transferred in 186.723820 secs (1121333400 bytes/sec)
So meaning 1121 MB/s. Equals ~1.1 GB/s.

In comparison, on TrueNAS SCALE STABLE we got: Write: 1.1 GB/s, Read: 1.1 GB/s. So identical speeds.

fio results (usual disclaimer as before):

# WRITE TEST:
fio --name TESTSeqWrite --eta-newline=5s --filename=fio-tempfile-WSeq.dat --rw=write --size=500m --io_size=195g --blocksize=1024k --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
WRITE: bw=6530MiB/s (6847MB/s), 6530MiB/s-6530MiB/s (6847MB/s-6847MB/s), io=195GiB (209GB), run=30578-30578msec

# READ TEST:
fio --name TESTSeqRead --eta-newline=5s --filename=fio-tempfile-RSeq1.dat --rw=read --size=500m --io_size=195g --blocksize=1024k --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
READ: bw=25.7GiB/s (27.6GB/s), 25.7GiB/s-25.7GiB/s (27.6GB/s-27.6GB/s), io=195GiB (209GB), run=7593-7593msec

Considering dd results:

Exactly the same read/write speeds as seen previously on TrueNAS SCALE STABLE.

One important side note for clarity: When I refer in this forum thread to TrueNAS CORE STABLE, I mean TrueNAS CORE 13.0-U6.2. When I refer to TrueNAS CORE 13.3 I mean the TrueNAS CORE 13.3-RELEASE.

nvs · October 2, 2024, 3:56pm

Still on TrueNAS CORE 13.3, now testing 4 disk MIRROR. Same test settings as before.

dd results:

# WRITE TEST:
dd if=/dev/zero of=/mnt/nas_data1/benchmark_test_pool/tmp.dat bs=1024k count=195k
199680+0 records in
199680+0 records out
209379655680 bytes transferred in 929.977578 secs (225144843 bytes/sec)
So meaning 225 MB/s.

# READ TEST:
dd if=/mnt/nas_data1/benchmark_test_pool/tmp.dat of=/dev/null bs=1024k count=195k
199680+0 records in
199680+0 records out
209379655680 bytes transferred in 320.348527 secs (653599559 bytes/sec)
So meaning 654 MB/s.

In comparison, on TrueNAS SCALE STABLE we got: Write: 238 MB/s, Read: 654 MB/s. Write speed similar, read speed identical.

fio results (usual disclaimer as before):

# WRITE TEST:
fio --name TESTSeqWrite --eta-newline=5s --filename=fio-tempfile-WSeq.dat --rw=write --size=500m --io_size=195g --blocksize=1024k --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
WRITE: bw=1519MiB/s (1593MB/s), 1519MiB/s-1519MiB/s (1593MB/s-1593MB/s), io=117GiB (126GB), run=78978-78978msec

# READ TEST:
fio --name TESTSeqRead --eta-newline=5s --filename=fio-tempfile-RSeq1.dat --rw=read --size=500m --io_size=195g --blocksize=1024k --fsync=10000 --iodepth=32 --direct=1 --numjobs=1 --runtime=60 --group_reporting
READ: bw=28.5GiB/s (30.6GB/s), 28.5GiB/s-28.5GiB/s (30.6GB/s-30.6GB/s), io=195GiB (209GB), run=6842-6842msec

Considering dd results:

Again, the very similar/identical speeds as seen previously on TrueNAS SCALE STABLE.

I will stop testing here as I can’t expect anyone to read all this.

Anyway, whats the take on expected speeds? What do we expect to see for RAID10 in a real-world scenario and what for example for a 2 and 4 disk mirror? I think the speeds are too low, but who knows maybe I am just wrong…!

B52 · October 2, 2024, 4:20pm

Thank you for spending so much time to bring light in the dark, the result is not that we expected. Anyhow, we see a big difference between theory and practice. For me, the consequence is going away from mirrors because the read performance is not really a benefit. (with ZFS)

Protopia · October 2, 2024, 4:55pm

I think what we are seeing here is some kind of a bug.

The idea that mirrors provide better performance is apparently true on Core, so the theory is proven to be correct.

If it is not the case on Scale in practice, that definitely suggests to me that it is a bug - could be a code bug in a Debian driver, could be a configuration bug in the default Debian configuration settings (Debian, SAMBA, ZFS) or it could be a bug in a TrueNAS SCALE override of these default settings.

So it not being a practical benefit on Scale right now is only going to be until the cause is tracked down and fixed - though of course that may be never if these performance indicators of a bug get no traction with the support teams that need to track it down and fix it.

That said, if this is reproduceable on any SCALE systems where the measurements are attempted, then it is presumably having an impact on every SCALE system worldwide, and quite possibly on every Debian system worldwide. If that isn’t enough to get some traction, I don’t know what is.

mrak · October 2, 2024, 6:02pm

Impressive work to get all of this data! Really educational for all of us!

It is likely something related to hard drive drivers (or something controling whole disk subsystem).

What occured to me is following idea; we usually think that disk drive speed is about 250mb/s (and looking at your core stripe test results this appears to be true). But this speeds are actually happening only on outer rims of your drives, when you fill the disk and go further towards the spindle center this speed will be reduced all the way down to somewhere about 100mb/s (or so).

Now imagine this (I have no way of knowing if this is true, probably not), what if core writes from outside in, and scale from the center outwards? If that is the case, than your speeds on scale are roughly within expectations.

Possible way of proving this is creating very large file on empty pool (think couple of terabytes) and then re-test and if result moves up…

Protopia · October 2, 2024, 7:43pm

Block allocation is not done by FreeBSD or Debian, but by ZFS. I would expect both versions of ZFS to allocate blocks in the same way.

Also, if the same tests show differences on SSDs (which don’t suffer from this inside/outside track difference) then that would rule out this theory.

nvs · October 2, 2024, 8:17pm

Remember that when I tested TrueNAS SCALE STABLE against TrueNAS CORE 13.3 that the read/write speed results with dd are almost identical for all of the tests I have run:

RAID10
4 disk STRIPE
4 disk MIRROR

So it seems that the unexpectedly low read speeds when mirroring is involved are “systematical” across both versions. There are other differences I have measured via SMB that strongly suggest that i.e. SCALE seems to have worse write speeds than CORE via SMB, but I think we should “handle” those later. The systematic bad read performance for configurations involving a mirror (i.e. RAID10, 2/4 disk MIRROR) seems the most mind boggling to me right now.

nvs · October 2, 2024, 8:23pm

Remember that we got pretty much the same speed results for SCALE STABLE vs CORE 13.3 in dd. So they aren’t different in that regard.
Picking up your theory, maybe what could be is that MIRRORED vs STRIPE does this differently. But I really doubt it. I expect no matter if its MIRRORED or STRIPED that the drives are filled from the same side inwards. In that case I would not expect any difference due to what you suggested. Also the speeds are basically cut in half when mirrored configuration is used, which seems almost too much for a coincidence. So I think there must be some interesting answer to that.

nvs · October 2, 2024, 8:30pm

Appreciate you guy’s kind words, BTW. It was a lot of effort spent, but I hope this helps the community indeed and we can get to the bottom of this once and for all! With more and more people moving to 10GBit, I expect more people may run into the same and see these issues.

I would be very interested to see if anyone else could re-produce these test results with a similar setup.

BTW, are there any iX system experts on this forum? Would be really nice to get some official comment about the low read speed when mirrors are involved.

Protopia · October 2, 2024, 9:28pm

I am really not absolutely clear what you mean by these three classifications:

ZFS does not have a formal type of pool definition called RAID10 - RAID+anything other than RAIDZ1/2/3 normally suggests some sort of hardware based RAID which is a no, no. I know you aren’t using it to refer to hardware RAID, but RAID10 normally refers to a stripe of mirrors which is normally how you define a mirrored pool with more than 2 mirrored drives i.e. it is a stripe across 2 vDevs each of which is a 2-way Mirror i.e. 2 copies of the data.[1]
I am assuming that when you say a 4-disk stripe, you mean striped across 4 vDevs each of which consists of a single non-mirrored drive i.e. 1 copy of the data.
I am assuming that when you say a 4-disk mirror, you mean a single vDev consisting of a 4-way mirror i.e. 4 copies of the data.

If you mean anything different then you need to be clearer. And I think we should try to use an unambiguous ZFS-centric terminology to define the configurations.

When performance testing reading data, there are a few things we need to consider:

Sequential read-ahead - ZFS reads the first block and sends it to the requestor, and in the mean time it starts to read the next block (and it might have read it by the time the requestor has received and processed the first block and then requests the next one). If the next request is for the next piece of the file, then you will get a different performance result than if the next request is for a different part of the file or for a different file.
Metadata access - reading random blocks from the same file will have different characteristics than reading random files due to more metadata reads being needed.
ARC - obviously any blocks already in ARC before the test starts will impact the results - so ARC should ideally be tiny or cleared before the test starts.
Data placement across the vDevs - if data is actually located on a single vDev you will obviously get different results than if it is spread evenly across multiple vDevs.
Parallelism - if you have a single stream of data, and you are not doing sequential read-ahead, then you will effectively only have one read at a time on a single vDev. You probably need many parallel streams to measure disk performance.
Seek optimisation - For mechanical heads, ZFS may do its own optimisation of queued reads to reduce average seek time and improve throughput. Additionally AHCI Native Command Queuing allows multiple operations to be optimised inside the drive on top of (or instead of) any optimisations done by ZFS. So it’s complicated.
And finally mirroring - For a read from a single stream, it should read the data only from one of the mirror drives and only go to the other one if the checksum fails in order to try to fix it. With multiple streams, it should ideally split the I/Os across all the available mirror drives and thus get better performance. But how it does this may also be important. Assuming that a single drive can only do a single operation at the same time and that you don’t do any sort of seek optimisation to resequence I/Os, Queuing Theory suggests that it should use a single queue, multiple server model. But ZFS may do something more complex because it is better.

So, how do these apply to the performance tests we are running and these 3 ways of configuring a pool:

I think we need to stop comparing different sizes of pool - and start to compare equal sized pools i.e. one vDev only, consisting of 1, 2, 3 or 4 drives in a mirror (yes, I know 1 drive isn’t a mirror but you get what I mean).
We need to run multiple parallel streams and certainly at least as many streams as there are disks in the mirror - but I am unclear whether we should run the same number of streams regardless of the mirror width (so we get more optimisation of reads when there are fewer drives but greater consistency of requests), or whether we should scale the streams with the number of drives (in order to keep the queuing optimisation consistent).
We should try to use a tiny ARC and random reads to a very large file to eliminate caching and read-ahead and get real reads-on-request from the drives themselves.
All other activity on the system should be quiesced.
We should probably do this with SATA SSDs and HDDs just so we can see if seek optimisation makes a difference.
We should do this using a local script or programme to eliminate network impacts.

I think if we do the above we will get a real understanding of the Disk performance without other factors coming into play.

[1] As opposed to a mirror of stripes - which I am not sure you can actually create in ZFS anyway.

nvs · October 2, 2024, 10:21pm

Uff, this is quite something to process (ChatGPT?)… First of all, I think my classifications are generally pretty clear or not? I also explained in my first post how exactly I configured the RAID10 in TrueNAS. Please share your suggested “unambiguous ZFS-centric terminology to define the configurations” so its clear what should be used instead, if its not clear already.

Not sure what you mean actually. I tested single disk, 2/4 disk mirror and RAID10 configs as explained. I don’t see how we would test 1/2/3/4 drives in a mirror config in a different way?

The suggested points what to test next sound very extensive and like a lot of work, maybe thats more something for iX systems themselves. I don’t see this feasible to do myself, also because half of the things I dont even understand.

I have asked several times previously for clear instructions on what tests to perform, spent now a good week full time on this and now I start to get some feedback on tests to perform. Don’t get me wrong, but this kind of teaching on how to perform benchmarking would have been more useful earlier in the process (and I also think that the benchmark tests done today were useful).

That said, maybe someone else likes to take on the challenge of trying all the things you suggested and continue from here, that could be interesting.

Protopia · October 2, 2024, 11:12pm

Written without AI by me.

There is no such thing as RAID10 in TrueNAS - you stripe across vDevs, and vDevs can be single disks, or mirrored or RAIDZ.

My own examples (and the above one) use ZFS-centric terminology of stripes and vDevs and Mirrors or RAIDZ1/2/3.

What I mean is that a stripe across 2 vDevs each of which is a single disk has twice the capacity of a single disk, whereas a single vDev 2-drive mirror has the same capacity as a single disk. When you say a 4 disk mirror, you are actually NOT talking about a 4-disk mirror at all - you are talking about a stripe across 2 vDevs each of which is a 2-disk mirror and has twice the capacity of a single disk. (Hence why you need to be clear about your configuration, and use terminology which is unambiguous. A 4-disk mirror is a single vDev with 4 drives in a mirror configuration, whereby you have the useable capacity of only a single drive because every piece of data is written 4x to all 4 drives and you can lose 3 disks of the 4 and still not lose any data.)

And what I am saying is that the comparison should be between disk configurations that are the same useable capacity i.e. a single vDev, mirrored or unmirrored . What is unclear about that?

I do appreciate how frustrating and time consuming this is. And I can sympathise that you may wish we had got to this point without all the intermediate discussions, but that is not how these sorts of nebulous diagnostics work in practice - you start with a problem you notice (windows explorer network write performance) and you carry out more measurements to try to determine what the problem is and to narrow down the cause, and with your system, time and effort we have made a lot of progress, knowing it is NOT a network or SMB issues, but also that it is even more of a read issue than the initial write issue you raised. All I can say is:

We are doing our best to help you; and
You are the person with the problem and a measurable one at that;
You have said how important this performance is to your specific use case;
You may have identified a major world-wide issue with Debian (and all its derivative distros);
IMO if you try to hand it off in the current state, it will likely be closed without a proper investigation because we have not yet got anything definitive and narrowed down. But by all means log an issue with iX and see if they are prepared to look into it.
Those of us trying to help you may not have environments where we can recreate the issue (I don’t) , and even if we have spare hardware we may not be able to reproduce the problem. We also have our own workloads and priorities. (I have a production TrueNAS system, and no lab hardware I can use. I have a busy life outside TrueNAS and IT, and giving advice on these forums using my decades of widespread corporate IT experience and my single year of TrueNAS experience is my way of giving something back in return for being able to use this free software. I am not a paid TrueNAS support any more than you are.)

Ultimately it is up to you whether to do a few more tests or not. You have done some great work so far, and I think it would be a pity if you dropped this now - but it is your time and effort and so your choice as to whether you keep at this until we narrow it down further or whether you drop it and move on.

nvs · October 2, 2024, 11:28pm

Sure, you cant select RAID10 in the drop down in TrueNAS. That’s why I made clear in my first post exactly how I configured the RAID10 (2 vDevs with 2 drives in them each).

No, when I say 4 disk MIRROR its exactly what you say is the case, a single vDev with 4 drives in a mirror. The other case you describe is what I explained in my first post/see thread subject line (and I call RAID10= 2vDevs with 2 disks per vDev).

Please read my test results again. You seem to misunderstand. Those tests were done.
Single drive capacity: single drive, “2 disk mirror”, “4 disk mirror”
Two drive capacity: “2 disk stripe”
Four drive capacity: “4 disk stripe”

About the rest: Don’t get me wrong, I obviously appreciate very much the support of everyone that has contributed to this. Please don’t let that come across different. What bothers me a bit is in line with what @B52 has flagged before that some of your replies get a little de-railed maybe talking about how network/benchmarking is to blame or making (somewhat strong) assumptions sometimes. Again, I believe you act in best interest to help. But you raised the point of doing meaningful benchmark testing now several times. If I ask several times for some “simple” benchmark commands to do such meaningful benchmark testing I get no reply. But I do get a reply after days having done my best to do such benchmarking with that “we need more parallel streams, xyz” to do proper benchmarking.
Bottom line, some more hands-on specific advice would go a long way. Again I cant expect anything from anyone here and I also don’t want to do that. Every help is appreciated and helps, but please see where I am coming from reading such a long reply with a lot of information again given what I just explained. Also, please bear in mind my knowledge is very limited in this field, so please have some mercy if I do mistakes/don’t understand something…

I might do some more tests myself, but as said I can not dedicate much more time to this issue as I have limited time just like everyone else here. I am new to this forum and I hoped a little for some more feedback from experts on simple questions like what are real world expected speeds. I dont think we got any clear idea even about that yet.

Anyway, help appreciated and please lets move forward…