ZFS RAIDZ2 read performance issues

Hello,

I have a pool consisting of a single RAIDZ2 vdev with 8 hard drives connected via a LSI 9211-8i HBA in IT mode.

These are 4x Seagate Exos X20 20TB SATA drives and 4x WD Ultrastar DC HC560 20TB SATA drives. On paper they are pretty similar specs wise, with the WD drives ever so slightly better performing. All the drives have been “burnt in” with extended read/write tests demonstating consistant individual drive performance and no errors. All drives have no SMART errors, no errors in the logs from the controller and no ZFS errors.

For some reason my sequential read speeds seem to be capped at around ~240MB/s. It’s a solid and consistant 240MB/s as well, it doesn’t dip or increase. Write speeds are fine, flucturating at around 1GB/s.

From iostat and zpool iostat there appears to be a bottleneck with the on paper faster WD Ultrastar drives. If attempt to sequentially read a large file from the pool, it gets pegged at around 240MB/s showing 100% util on the 4 ultrastar drives and like 20% on the Exos drives.

zpool iostat

                                            capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool                                      alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
xxxxx                                     45.0T   100T    934      0   235M      0   91ms      -   14ms      -  226ms      -   52ms      -      -      -      -
  raidz2-0                                45.0T   100T    934      0   235M      0   91ms      -   14ms      -  226ms      -   52ms      -      -      -      -
    se1                                       -      -    173      0  28.8M      0    1ms      -    1ms      -      -      -   63us      -      -      -      -
    se2                                       -      -    172      0  28.7M      0    1ms      -    1ms      -      -      -   60us      -      -      -      -
    se3                                       -      -    172      0  28.8M      0    1ms      -    1ms      -      -      -   64us      -      -      -      -
    se4                                       -      -    172      0  29.0M      0    1ms      -  963us      -      -      -   29us      -      -      -      -
    wd1                                       -      -     62      0  30.4M      0  335ms      -   50ms      -  201ms      -  204ms      -      -      -      -
    wd2                                       -      -     61      0  29.9M      0  340ms      -   50ms      -  268ms      -  201ms      -      -      -      -
    wd3                                       -      -     59      0  29.4M      0  315ms      -   50ms      -      -      -  204ms      -      -      -      -
    wd4                                       -      -     60      0  29.9M      0  386ms      -   50ms      -  201ms      -  204ms      -      -      -      -

iostat

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
loop0            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
nvme0n1          0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
wd1             60.00     30.00     0.00   0.00   49.93   512.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.00 100.00
wd2             60.00     30.00     0.00   0.00   49.90   512.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.00  99.60
wd3             60.00     30.00     0.00   0.00   49.95   512.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.00 100.00
wd4             60.00     30.00     0.00   0.00   49.95   512.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    3.00 100.00
se1            181.00     30.34     0.00   0.00    1.30   171.62    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.24  10.80
se2            177.00     30.16     0.00   0.00    2.27   174.49    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.40  15.60
se3            177.00     30.34     0.00   0.00    2.55   175.50    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.45  20.00
igr              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
se4            175.00     30.17     0.00   0.00    2.47   176.53    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.43  18.80

I can’t figure out what is happening. For some reason the WD drives seem to get far fewer, but much larger read requests which cause the drive to have much higher latency? The Seagate drives get much more but smaller requests but all are reading the same amount of data. Why would there be this difference, or am I misinterpreting the output? It’s pretty consistant across all the WD drives so I don’t think the drives are bad, unless all 4 of them are (and all very consistant in their failure symtoms haha)

If anyone can offer any ideas or suggestions I’d really appreciate it.

Thanks!

Even though the LSI 9211-8i is only 6 Gb/s, that bandwidth should be more than sufficient for 8 HDDs.
(If the HBA is not installed in a server chassis with proper airflow, but for example in a standard PC case, it could suffer from heat-related issues and should be actively cooled.)

How are the HDDs connected to the HBA ?

for example:
HBA Port A: SFF-8087 → 4× SATA (Seagate)
HBA Port B: SFF-8087 → 4× SATA (WD)

or via Backplane ?
If the burn-in was performed in this same configuration (possibly even with all drives simultaneously), and the WD drives showed their expected performance, then the HBA and cables / Backplane can probably be ruled out as the cause.

Otherwise, it could still be the cables, since SATA has a maximum recommended length of ~1 m, and inexpensive or low-quality cables can occasionally cause issues.

A useful test could be to swap the HBA ports for the WD and Seagate drives and see if the problem moves with the drives.

Its in a HP Proliant gen 9 server chassis. The HBA connects to two 4 bay backplanes via 2x SFF-8087 → SFF-8087 cables which came with the server.

Just tried swapping the ports over. I’ve tried moving all the WD drives to the other port and I’ve also tried mixing 2x WD and 2x Seagate on one backplane and the rest on another. It doesn’t seem to make any difference, the WDs still have the high latency and 100% util when reading.

They were in this server and all tested using badblocks in parallels across all the drives simultanously and there were no bottlenecks during this, the read/write performance of the drives individually matched the expected performance of the drives.

:confused:

Is this exclusively over smb or replicated between local pools too?

It’s really odd, I’m not sure it’s a hardware problem. When I’m read testing I’m just literally reading some large 50GB+ video files, I wonder if there is a problem with how these have been written?

A scrub on the pool completes with an average speed of 1.2GB/s. If I look at one of the WD drives during the scrub you can see the kind of read speed I’d expect to see:

Reading files from the pool (plus a quick write test) the WD drive has a really low read speed:

For testing I’m just dd’ing a large file for ease, but I get the same speed if I copy a file to a local machine via SMB (this is how I noticed the issue, the speed of copying a file over the network) or if I cp the file locally to another pool (the boot pool, which is on a nvme ssd)

Huh… wonder if it is some crazy fragmentation going on? Any chance you got encryption turned on? That’s about the only other things i could think of… i don’t think they make smr drives in those sizes :stuck_out_tongue:

Oh yeah I didn’t think about fragmentation. Yes, the pool is encrypted. The server is fairly old and has a Intel Xeon E5-2650v4 but none of the cores get above 10% when copying at the moment so I don’t think encryption should be a bottleneck?

With fragmentation I tried copying some of the files (using rsync) to a new dataset but the performance was the same.

I did notice something strange though, since my dataset mainly contains large files I have the record size at 1M so for a test I created another dataset with recordsize 128K and the sequential read performance… increased? Only up to 350MB/s though, exhibiting the same behavour as before just with a higher cap. This really confuses me, why would decreasing the recordsize increase the sequential read speed?

Beats me - but at least encryption & fragmentation kinda, somewhat, sorta explain the weird request differences?

No clue what to actually do about it to help fix the issue, but hopefully this is the right direction?

Thanks yeah, it’s making me think there is something odd about how these files are stored, but I don’t understand why newly written files have the same problem and only the WD drives seem to complain about this.

I’ve managed to dramatically increase the speed though by setting zfs_vdev_read_gap_limit to 1M and zfs_vdev_aggregation_limit to 2M I can get 700MB/s read via dd and smb is at around 550MB/s (not done more research into why there is now this difference, might be SMB client problem so I’ll use the dd figures as benchmark for now)

It’s better but not optimal, plus it’s really bugging me why there is a huge difference in IO stats for the WD/Seagate drives when reading.

Had a bit more spare time to play around with this and I’ve discovered some more things. I tried disabling NCQ on the WD drives and the performance absolutely tanked to rock bottom so don’t do that. But that got me thinking, if disabling queueing had such a negative impact on performance, what about if we could queue more

In my iostat output you can see the WD drives are stuck at a queue size of 3 which is far less than the device’s queue length of 32. I’m assuming this is limited by zfs_vdev_async_read_max_active which is set to 3. So I increased this to 12 and reads are now 0.9-1GB/s and the iostat outputs are no longer showing the WD drives at 100% util and util, requests per second and latency is pretty much even across all the drives, both WD and Seagate with the WDs now marginally faster.

I’m wondering if these WD drives have some weird firmware quirk? Perhaps how it internally reorders reads is expecting to have a higher queue saturation or something?

1 Like

No clue on my end; only person on the forums that gets deep into benchmarking is @NickF1227 , maybe he can give some directions

1 Like

In this case,it sounds like @mrtachyon is on to something already. Albeit, messing with the I/O Scheduler isn’t something I’ve spent a whole lot of time doing. Thar be dragons!

The iostat shared from (presumably) before tuning is interesting. It is reading alot more data from the WD drives than the Seagate ones.

Curious variable I should ask…Was this pool created with the WD Drives, and then later with ZFS expansion, the Seagate Drives were added?

Can you revert back to the default zfs_vdev_async_read_max_activevalue, run zpool iostat -yr 15 and reproduce the same test using the same file you’d used previously?

1 Like

Thanks, no this is a fresh pool created with all the drives in one go. The seagate drives have been used previously though, the WD ones are brand new. The output from this after reverting that param is:

xxx              sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size         ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
-------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K                 0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K                 0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K                 0      0      0      0      0      0      0      0      0      0      0      0      0      0
8K                 0      0      0      0      0      0      0      0      0      0      0      0      0      0
16K                0      0      0      0      0      0      0      0      0      0      0      0      0      0
32K                0      0      0      0      0      0      0      0      0      0      0      0      0      0
64K                0      0      0      0      0      0      0      0      0      0      0      0      0      0
128K               0      0      0      0  1.62K      0      0      0      0      0      0      0      0      0
256K               0      0      0      0      0     27      0      0      0      0      0      0      0      0
512K               0      0      0      0      0     48      0      0      0      0      0      0      0      0
1M                 0      0      0      0      0    280      0      0      0      0      0      0      0      0
2M                 0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M                 0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M                 0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M                0      0      0      0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------------------------------

Although even after changing zfs_vdev_async_read_max_active to my modified value of 12 it doesn’t change the values output. As an aside after some tweaking I found 16 to offer a slight improvement, but any more than this didn’t seem to do anything.

This also led me down another rabbit hole as ZFS is issuing 1M IOs but iostat never reports anything above 512K. From some reading it looks like the kernel will almost always split IOs that exceed /sys/block//queue/max_segments * page size (4kb). My disks report they “prefer” 16M IOs for sequential ops (/sys/block//queue/optimal_io_size). max_segments on my system is 128 which results in a limit of 512K (128 * 4). After some more digging I found that max_segments is set by the HBA which has the default value of 128 in mpt3sas whereas the default for a generic SCSI driver (and broadcom’s newer HBA driver) is 2048. I don’t really understand the detail at this point but it’s something to do with DMA operations and it looks like 128 is a safer value as some old architectures don’t support more and increasing it will consume more memory. However, I don’t think this applies to modern x86/x64 systems and it appears the mpt3sas driver will let you increase it to 2048 by setting the max_sgl_entries parameter. Just to make it more complicated apparently sometimes you’ll be able to get IOs bigger than 512K if the memory allocated for the buffer happens to be contiguous.

This all might be irrelevant and totally off track but when I get some more free time I’ll have a play around with this and see if it has any impact whatsoever.

Okay. This graphic basically was just to prove that there was not something unexpected with the blocks you had written to disk as you’d suspected. Since we are readfing 128K blocks with some larger req_size in aggregations it looks normal.

Now, with the change still reverted, can you also check

zpool iostat -yvr 15

This will print the same chart, but for each individual disk. It should tell us if ZFS is for some reason treating the disks differantly.

Here’s the output from that, I’ve renamed the part uuids to match the type of device so you better see which is which

raidz2-0                                    sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size                                    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
32K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
64K                                           0      0      0      0      4      0      0      0      0      0      0      0      0      0
128K                                          0      0      0      0  1.46K      0      0      0      0      0      0      0      0      0
256K                                          0      0      0      0      0     22      0      0      0      0      0      0      0      0
512K                                          0      0      0      0      0     49      0      0      0      0      0      0      0      0
1M                                            0      0      0      0      0    249      0      0      0      0      0      0      0      0
2M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------------------------------------


se1                                         sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size                                    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
32K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
64K                                           0      0      0      0      1      0      0      0      0      0      0      0      0      0
128K                                          0      0      0      0    309      0      0      0      0      0      0      0      0      0
256K                                          0      0      0      0      0      3      0      0      0      0      0      0      0      0
512K                                          0      0      0      0      0      6      0      0      0      0      0      0      0      0
1M                                            0      0      0      0      0     13      0      0      0      0      0      0      0      0
2M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------------------------------------

se2                                         sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size                                    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
32K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
64K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
128K                                          0      0      0      0    320      0      0      0      0      0      0      0      0      0
256K                                          0      0      0      0      0      2      0      0      0      0      0      0      0      0
512K                                          0      0      0      0      0      5      0      0      0      0      0      0      0      0
1M                                            0      0      0      0      0     12      0      0      0      0      0      0      0      0
2M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------------------------------------

se3                                         sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size                                    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
32K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
64K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
128K                                          0      0      0      0    301      0      0      0      0      0      0      0      0      0
256K                                          0      0      0      0      0      2      0      0      0      0      0      0      0      0
512K                                          0      0      0      0      0      5      0      0      0      0      0      0      0      0
1M                                            0      0      0      0      0     16      0      0      0      0      0      0      0      0
2M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------------------------------------

se4                                         sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size                                    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
32K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
64K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
128K                                          0      0      0      0    317      0      0      0      0      0      0      0      0      0
256K                                          0      0      0      0      0      3      0      0      0      0      0      0      0      0
512K                                          0      0      0      0      0      6      0      0      0      0      0      0      0      0
1M                                            0      0      0      0      0     11      0      0      0      0      0      0      0      0
2M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------------------------------------

wd1                                         sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size                                    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
32K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
64K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
128K                                          0      0      0      0     65      0      0      0      0      0      0      0      0      0
256K                                          0      0      0      0      0      2      0      0      0      0      0      0      0      0
512K                                          0      0      0      0      0      6      0      0      0      0      0      0      0      0
1M                                            0      0      0      0      0     47      0      0      0      0      0      0      0      0
2M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------------------------------------

wd2                                         sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size                                    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
32K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
64K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
128K                                          0      0      0      0     44      0      0      0      0      0      0      0      0      0
256K                                          0      0      0      0      0      2      0      0      0      0      0      0      0      0
512K                                          0      0      0      0      0      5      0      0      0      0      0      0      0      0
1M                                            0      0      0      0      0     51      0      0      0      0      0      0      0      0
2M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------------------------------------

wd3                                         sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size                                    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
32K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
64K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
128K                                          0      0      0      0     50      0      0      0      0      0      0      0      0      0
256K                                          0      0      0      0      0      2      0      0      0      0      0      0      0      0
512K                                          0      0      0      0      0      5      0      0      0      0      0      0      0      0
1M                                            0      0      0      0      0     51      0      0      0      0      0      0      0      0
2M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------------------------------------

wd4                                         sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size                                    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8K                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
32K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
64K                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
128K                                          0      0      0      0     80      0      0      0      0      0      0      0      0      0
256K                                          0      0      0      0      0      3      0      0      0      0      0      0      0      0
512K                                          0      0      0      0      0      7      0      0      0      0      0      0      0      0
1M                                            0      0      0      0      0     44      0      0      0      0      0      0      0      0
2M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M                                            0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M                                           0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------------------------------------

This is almost certainly caused by some pretty differant frmware implementations between these two drives. I’ve never seen such an extreme case when comparing two spinning drives with similarish specs.

I am reading this output as “The WD drive is responding faster, give it more work”. Which in looking at iostat we see 100% util of those drives and zpool iostat we see much better aggregation behavior.

If you do lsblk -o NAME,RQ-SIZE do they all have the same RQ-SIZE? What about queue depth? cat /sys/block/sdXX/device/queue_depth

I would expect these to probably be the same for both drive types if theya re connected through the same HBA. But worth checking here.

Do you see the same issues with writes?

They all have the same RQ-SIZE (256) and queue_depth (32). I did spend some time comparing the two types of drive and they do have some differences:

  • WD Supports ATA ACS-5 vs Seagate ATA ACS-4
  • WD Supports NCQ Priority, NCQ Streaming and NCQ Non-Data Commands whereas the Seagate doesn’t
  • WD has 512MB cache vs Seagate 256MB
  • WD has 9 platters va Seagate 10 platters
  • WD has a 64GB NAND chip for internal use (not data)

I’ve also tried connecting the drives to the onboard RAID controller in AHCI mode just to test but it experiences exactly the same issue.

After resetting all the ZFS parameters to default I can get 1-1.2GB/s reads by setting only zfs_vdev_async_read_max_active to 32 and changing nothing else. However, if I reset it to 3 the speeds are back down to 240MB/s

Interestingly I get suspeciously neat throughput figures depending on the value of zfs_vdev_async_read_max_active

zfs_vdev_async_read_max_active Individual WD Read Speed
1 10 MB/s
2 20 MB/s
3 30 MB/s
4 40 MB/s
5 50 MB/s

etc.

The throughput of the read on the pool is then 8x the individual WD read speed. These neat values continue until around 10-12 when the read speeds start fluctuating. It’s almost like something is throttling each individual IO request to prevent a single request from monopolising the drive.

This is only on reads, I don’t have the same issue with writes - they are fine.

I also can’t reproduce this behavour with fio. If I read test one of the wd drives directly using fio I easily get 270MB/s+ on a individual drive fully sequential even with iodepth=1 and 128K blocks. Adjusting the iodepth and block size I get expected performance across the range. Even if I try to simulate a fragmented disk by doing random short sequential reads across the disk I still get 140MB/s again even with iodepth=1. I’ve tried various sequential/random reads and I just can’t reproduce the behavor of the disk when in a vdev. The closest I got was fully random 128K reads with iodepth=1, where it hovered around 10MB/s ish, but it wasn’t as consistant and didn’t scale with iodepth (maxing out at 22MB/s).

I ran the exact same fio tests on a seagate exos drive and got very similar results, only fractionally below the WD.

It seems like it’s something specific to either how ZFS is issuing the reads or the pattern of reads that is causing the problem.