Some more tests I ran on my 5x wide RAIDZ1 (not directly applicable to mirrors, but I am more interested in seeing what impact changing some of the fio parameters).
Parallelism
This time I ran sudo zpool iostat -l hdd-pool 10
in parallel with the test, and I ran first with 5 and 16 jobs and then with 8, 16 and 24 jobs to see what difference that made.
IOSTATs
When generating the files, done with async writes I could reach 550-600MB/s. I think reflects the batching that ZFS does when writing asynchronously to disk.
When reading I could only hit c. 250MB/s with 5 processes but 400MB with 16 processes. This suggests that we should try with significantly more processes than 5.
The stats with 8, 16 and 24 processes were 380, 395 and 401MB/s respectively.
I think we should therefore focus only on tests with āprefetch=offā, and run tests starting with 8 processes, and then keeping adding 8 processes until it levels off.
And we should do this on single drive, a 2x mirror and a 2x stripe of 2x mirror.
Since we have data caching off, the size of the individual datafiles seems less important providing that we have several tens of GB.
Blocksize
I tried a blocksize of 1K instead of 1M and got c. 1/10 of the throughput. So clearly the blocksize is important to these tests.
With 128K (the default dataset record size) the throughput was down only c. 10%-20%.
Obviously with fio
we can specify the blocksize for each test, but this is normally the record size, so choosing a recordsize wisely for each dataset that reflects both the size of the files and the performance characteristics you want is going to be key to getting the maximum throughput from your drives.
NOTE: I donāt think that the block size is particularly relevant to whether a mirror performs twice as well as an unmirrored drive, or whether vDevs scale linearly, but when you look to see whether you are using up all the disk bandwidth, then knowing that your benchmark has enough processes to max out is important, and maxing out is a good way of determining whether you need more parallelism the more disks you have.
How this relates to the reported problem
The problem reported was that SCALE was not performing as well as CORE (over SMB).
- I wonder whether the dataset record size was different.
- Writing a single stream over SMB doesnāt give us the parallelism we appear to need. But that should be true on both SCALE and CORE.
- Letās see what parallelism is needed to max out a mirror / dual vDev - because if we can max out the disks with enough parallelism but we think we shouldnāt need that level of parallelism to achieve it, then that is an entirely different problem than not being able to get decent disk bandwidth at all.
I think we are making some good progress here. Letās try to keep going a little longer.
P.S. Here is the script I am currently using:
zfs set primarycache=metadata hdd-pool
zfs set prefetch=none hdd-pool
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=1M --size=4G --numjobs=24 --time_based --runtime=30
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=1M --size=4G --numjobs=16 --time_based --runtime=30
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=1M --size=4G --numjobs=8 --time_based --runtime=30
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=128K --size=4G --numjobs=24 --time_based --runtime=30
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=128K --size=4G --numjobs=16 --time_based --runtime=30
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=128K --size=4G --numjobs=8 --time_based --runtime=30
zfs set prefetch=all hdd-pool
zfs set primarycache=all hdd-pool
rm -rd /mnt/hdd-pool/disktest/*