Bad SMB write/read performance with 4 drives in 2x mirror configuration (RAID10)

Protopia · October 7, 2024, 9:26pm

Some more tests I ran on my 5x wide RAIDZ1 (not directly applicable to mirrors, but I am more interested in seeing what impact changing some of the fio parameters).

Parallelism

This time I ran sudo zpool iostat -l hdd-pool 10 in parallel with the test, and I ran first with 5 and 16 jobs and then with 8, 16 and 24 jobs to see what difference that made.

IOSTATs
When generating the files, done with async writes I could reach 550-600MB/s. I think reflects the batching that ZFS does when writing asynchronously to disk.

When reading I could only hit c. 250MB/s with 5 processes but 400MB with 16 processes. This suggests that we should try with significantly more processes than 5.

The stats with 8, 16 and 24 processes were 380, 395 and 401MB/s respectively.

I think we should therefore focus only on tests with “prefetch=off”, and run tests starting with 8 processes, and then keeping adding 8 processes until it levels off.

And we should do this on single drive, a 2x mirror and a 2x stripe of 2x mirror.

Since we have data caching off, the size of the individual datafiles seems less important providing that we have several tens of GB.

Blocksize

I tried a blocksize of 1K instead of 1M and got c. 1/10 of the throughput. So clearly the blocksize is important to these tests.

With 128K (the default dataset record size) the throughput was down only c. 10%-20%.

Obviously with fio we can specify the blocksize for each test, but this is normally the record size, so choosing a recordsize wisely for each dataset that reflects both the size of the files and the performance characteristics you want is going to be key to getting the maximum throughput from your drives.

NOTE: I don’t think that the block size is particularly relevant to whether a mirror performs twice as well as an unmirrored drive, or whether vDevs scale linearly, but when you look to see whether you are using up all the disk bandwidth, then knowing that your benchmark has enough processes to max out is important, and maxing out is a good way of determining whether you need more parallelism the more disks you have.

How this relates to the reported problem

The problem reported was that SCALE was not performing as well as CORE (over SMB).

I wonder whether the dataset record size was different.
Writing a single stream over SMB doesn’t give us the parallelism we appear to need. But that should be true on both SCALE and CORE.
Let’s see what parallelism is needed to max out a mirror / dual vDev - because if we can max out the disks with enough parallelism but we think we shouldn’t need that level of parallelism to achieve it, then that is an entirely different problem than not being able to get decent disk bandwidth at all.

I think we are making some good progress here. Let’s try to keep going a little longer.

P.S. Here is the script I am currently using:

zfs set primarycache=metadata hdd-pool
zfs set prefetch=none hdd-pool
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=1M --size=4G --numjobs=24 --time_based --runtime=30
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=1M --size=4G --numjobs=16 --time_based --runtime=30
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=1M --size=4G --numjobs=8 --time_based --runtime=30
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=128K --size=4G --numjobs=24 --time_based --runtime=30
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=128K --size=4G --numjobs=16 --time_based --runtime=30
fio --name TESTSeqWriteRead --eta-newline=5s --directory=/mnt/hdd-pool/disktest --rw=read --bs=128K --size=4G --numjobs=8 --time_based --runtime=30
zfs set prefetch=all hdd-pool
zfs set primarycache=all hdd-pool
rm -rd /mnt/hdd-pool/disktest/*

Protopia · October 7, 2024, 9:47pm

I have done some further research and someone on Reddit has pointed me to a whole bunch of ZFS pool parameters that decide how “sticky” (for lack of another word) a device in a mirror vDev is.

zfs_vdev_mirror_rotating_inc
zfs_vdev_mirror_non_rotating_inc - default
zfs_vdev_mirror_rotating_seek_inc
zfs_vdev_mirror_rotating_seek_offset
zfs_vdev_mirror_non_rotating_seek_inc

The calculation for spinning rust is as follows:

Start with zfs_vdev_mirror_rotating_inc - default 0
See whether the next read’s offset is within zfs_vdev_mirror_rotating_seek_offset of the last read’s offset - default 1MB
If it is within that amount add half of zfs_vdev_mirror_rotating_seek_inc, if not add the full zfs_vdev_mirror_rotating_seek_inc - default 5

So if the next read is within 1MB of the previous one, the factor will be 2, if not 5. How this factor is used with the disk loading to decide which drive to use is unclear from the documentation - I would have to read the code.

However, it seems that ZFS prefers to a significant extent to keep using a single mirror, especially if the seek distances are small - this might well cause reads of physically adjacent blocks to be limited to a single drive rather than spreading them over two drives, particularly on low numbers (or a single) process / stream (which is where we came in).

So we now have 3 potential factors to explore tweaks to:

The impact of Parallelism or of single stream
Block size
Pool parameters

That said, I am beginning to think that what OP was seeing is a single stream phenomina where both disks are idle between one read (plus prefetch) and the next request that requires a physical read, and it doesn’t necessarily matter which disk reads the next block because both are idle anyway. (Note: this is absolutely more of an update needed to guidelines as @B52 suggested.)

When you have the same number of streams as physical disks, with each stream reading consecutive blocks, then we would probably want each stream to monopolise a disk, and with the parameters above, this should happen - but a tweak to these parameters might potentially help this to happen even better.

I am also beginning to think we should test with prefetch=on now.

MBILC · October 8, 2024, 6:58pm

Just to note going back it looks like in your fio test your size was only 500MB, your file size should be larger than your total RAM to avoid it being cached…

Protopia · October 8, 2024, 9:53pm

I am not sure which test you are looking at, but the test I started to use was 16GB and my revised tests are 4GB per job.

richardm · November 24, 2024, 11:38pm

Sorry for necroposting but today I’ve spent a few hours playing with the mysterious zfs_vdev_mirror_rotating_inc et al in my SCALE system. I have four [somewhat] mismatched HDDs in 2x mirror vdevs and I’ve suffered inconsistent/slow sequential big block reads (thruput perhaps 1.5x to 2x that of a single HDD).

Because I’m a huge nerd with no life I dedicate one of four displays to a continuous iostat and a continuous arcstat. One thing that’s frustrated me is the %util column in iostat during these big block seq. reads.

IMHO at least one if not two disks ought to be pegged to 100% were the system truly extracting all possible performance from this humble pool. Watching them pull back from 100% for several seconds has me wondering just wtf my TrueNAS is so busy doing that it can’t keep these lowly SATA HDDs fully saturated.

I can’t answer that question nor can I answer why rareq-sz is so often in the 128-256k range when recordsizes, blocksizes, and request sizes are cranked to the moon at every layer.

BUT I’ve found a tunable combination that seems to help. rareq-sz is now closer to 1M, at least one spindle is maintaining 100 %util during most iostat samples, and overall read throughput from the pool is perhaps 2.5-3x that of a single drive. It no longer “stumbles” with 1-3 lousy samples in a row.

/sys/block/*/queue/scheduler to none. This helped cure the “stumbles” where zfs was seemingly backing down from holding the HDDs feet to the fire. It didn’t make the normal samples any faster.
zfs_vdev_mirror_rotating_inc ≥ 1. I’m on 5 but the exact number doesn’t seem to matter much. The important point is any non-zero value hurt my performance until…
zfs_vdev_async_read_max_active = 1; zfs_vdev_async_read_min_active = 1; zfs_vdev_sync_read_max_active = 1; zfs_vdev_sync_read_min_active = 1. This came as a surprise and just seemed wrong until I spotted something in the zfs docs:

zfs_vdev_max_active
The maximum number of I/Os active to each device. Ideally, zfs_vdev_max_active >= the sum of each queue’s max_active.

Once queued to the device, the ZFS I/O scheduler is no longer able to prioritize I/O operations. The underlying device drivers have their own scheduler and queue depth limits. Values larger than the device’s maximum queue depth can have the affect of increased latency as the I/Os are queued in the intervening device driver layers.

So zfs_vdev_mirror_rotating_inc is a throttling mechanism of sorts but with a tank of pending I/Os sitting between the algo and the devices it likely can’t accomplish its task and just gets in the way. Smashing the zfs_vdev_*_read_*_active tunable quartet to 1 allows zfs_vdev_mirror_rotating_inc to get up close and personal with the drives and micro-manage their work. At least that’s my theory.

zfs_vdev_read_gap_limit at the default (32K) wasn’t nearly enough for my system. I’m doing sequential, in-order, back-to-back reads so this shouldn’t matter, right? But it does. This one was pretty significant. I went with 2M plus a little extra to accommodate the 2M NTFS clusters in my iSCSI-accessed zvol. The thought is to allow the client to hopscotch over one cluster without breaking stride. I need to play with this one more.
Minor players: zfs_vdev_mirror_rotating_seek_offset and zfs_vdev_aggregation_limit. Set to the same as #5 and needs more testing.
Not useful in my case: zfetch_min_distance, zfetch_max_distance, and zfs_vdev_max_active.

I may have murdered my 4k random performance and hopefully I didn’t break the sensitive-to-free-space-write-balancing-across-vdevs mechanism. Whatever it’s called… And who knows how it would perform with flash media while tuned like this. But my spinning rust reads are noticeably faster. “It became necessary to destroy the town to save it”

I’m also tuning my Windows iSCSI client as it was ~25% slower than similar fio reads run locally on the SCALE host. The following seems to be helping – needs more testing:

Hope this is helpful or at least interesting to members of this thread.

Protopia · November 25, 2024, 1:26pm

Definitely interested in this. No time today to look at the details, but I will try to come back at some point soon and review.

That said, my starting point is generally that the zfs experts know more than I do, and their defaults are normally a good balance across the majority of environments.

However, ZFS has a lot of optimisation, and it has different optimisations for HDDs (with seek delays) and SSDs (without) and I agree with you that it doesn’t seem like it should make any sense to allow disks to try to optimise further by queuing more than the currently executing I/O and the immediately following one (to avoid even the short time between completing one I/O and the O/S having sent the next one).

Any additional experimentation you can do with the various I/O schedulers and ZFS would be interesting.

If you can recall what the parameters were before you changed them that would also be useful.

richardm · November 26, 2024, 1:34pm

Agree on the defaults. That said, a small pool of slow mirror vdevs is a rather common scenario and this isn’t a new problem.

After posting the above I surfed some Google hits on zfs_vdev_max_active and it seems I’m not the first person to notice spinny rust mirror vdev read throughput picks up after a max_actives smashdown.

One of those hits was from the old version of this very forum – a post from 2014 where a spoilsport really polished his keycaps telling my predecessor to knock it off with the tunables because reasons. Reminds me of the lawyer in Jurassic Park when the kid grabbed night optics from the back of the SUV:

“Is it heavy? Then it’s expensive. Put it back!”

One minute later that lawyer was eaten by a tyrannosaurus, BTW. Glory!

max_actives pinned to 1 murders flash throughput. I discovered this later when a string of heavenly L2ARC hits did little to help the read operation. I reckon with slow HDDs you don’t notice the latency inherent to doing one-by-one I/O because it’s swamped by the drive’s own latency.

Trouble is I don’t see a way to provide different max_actives for multiple devices. Right now it’s looking like the rust has to suck or the flash has to suck. Still chipping away at this…

ZFS is a cruel mistress.

Protopia · November 26, 2024, 2:05pm

The HDD in-drive optimisation is based on the fact that the physical head seek time and rotational delay is a lot longer than the I/O time - and so it makes sense for overall average response time and throughput to resequence a queue of I/Os so that the head sweeps back and forth doing I/Os as it goes rather than leaping about due to un-optimised I/Os. Obviously this optimisation makes zero sense for SSDs (et al) where neither physical seeks nor rotational delays apply, and the I/O queues in SSDs are there to avoid unused gaps in I/O capacity whilst waiting for the CPU to recognise that one I/O has completed and send the next one. I suspect that it may be even more complex for NVMe and Optanes which may well have multiple PCIe lanes and be able to do multiple I/Os simultaneously.

I would have thought that max_actives = 2 was optimum for both SATA HDDs (where ZFS is doing the optimisations) and SSDs, but NVMe and Optanes might need something significantly higher.

Are these tuneables system wide or are they per pool or per vDev?

(And apologies for not yet having had the time to review your detailed tuneable experiements.)

richardm · November 27, 2024, 7:56pm

The tunables appear to be system-wide.

So the reads from spinning rust – at least my reads – appear to be mostly async thanks to the disk read-ahead gnomes. L2ARC reads appear to be mostly sync. I’m looking for an iostat type utility where I can observe this in real time. On my system zpool iostat -r seems oblivious to CACHE device reads otherwise it’d be perfect.

With this in mind I’ve restored zfs_vdev_sync_read_max_active to defaults whilst keeping zfs_vdev_async_read_[*]_active set to 1. Thus far I seem to be having my cake and eating it too.

Still ~~playing with~~testing over here.

richardm · December 26, 2024, 10:12am

I’m back with something slightly better for my HDD reads. It doesn’t screw up SSD reads like my first attempt did.

/sys/block/*/queue/scheduler still goes to none

zfetch_min_distance to 4194304 (default is 2MB)
zfetch_max_distance to 67108864 (default is 8MB)

zfs_vdev_async_read_min_active to 1 (back to what I think is the default)
zfs_vdev_async_read_max_active to 3 (also restoring the default)

zfs_vdev_sync_read_min_active to 10 (ZFS defaults listed as 10 but my TrueNAS appears to default to 2 – did ix tweak this?)
zfs_vdev_sync_read_max_active to 10 (likely the default)

zfs_vdev_mirror_rotating_inc to 0 (back to the default)

This one is crucial:
zfs_vdev_read_gap_limit to 1114112 (default is 32k)

Less crucial but still fairly impactful:
zfs_vdev_aggregation_limit to 8454144 (default is 1MB)

This the breakthough that allowed me to lift max_actives above 1 thereby restoring performance for my SSD SLOG and L2ARC devices:
zfs_vdev_mirror_rotating_seek_inc to 0 (default is 5)

I’m leaning toward something not quite right about the load-balancing heuristics in ZFS vdev_mirror.c. Maybe it’s great when vdev components are perfectly matched, symmetric, parallel, congruent, and harmonized. With my junk-drawer anything-goes zpool it has to be neutered via seek_inc at zero.

End result: reading six-point-something GB files with a Windows iSCSI client from HDD in 30 secs is now an 18-20 sec. process. It was 22-24 secs. with the settings I’d posted here a few weeks ago.

As a side note I briefly had a second Windows iSCSI client and noticed the same large uplift in throughput after setting transfer sizes in the registry to 1MB (see #125 above)