Large media files have slow read performance. Would adding a special metadata VDEV improve this?

NickF1227 · August 28, 2024, 6:24pm

What you’re seeing here is not a “slowdown” of the pool, but more that you’re reading the data out of ARC instead of out of disk for a period of time. Thats why you see the demand_data and prefetch_data jump up just after the demand_metadata.
This is then followed by the ARC Hit Ratio dropping, and the performance decreasing.

Looking at your output from zpool iostat I’m a bit puzzled. It looks more like you are WRITING to the pool not reading from it. Are you moving a file on the TrueNAS and writing it back to the TrueNAS or something?

Videodrome · August 28, 2024, 7:18pm

To be honest, I’m still confused by the ARC hit ratio decreasing while it’s reading data. And I’m not quite sure what you mean.

From what I understand, the ARC (RAM) is generally used for write caching. Since I’m not writing any files but instead reading a 100GB file from the server, it shouldn’t be interacting with the RAM, especially since the file doesn’t fit into it. Or could this be related to metadata? Would using an L2ARC for metadata help in this situation? Others have mentioned that it wouldn’t affect a one-time read of large files, and I tend to agree.

I also don’t understand why the system can sustain 10GbE speeds for about a minute before dropping off. As far as I know, the file is being read directly from the disk pool without an intermediary cache. (If only ZFS supported this.)
This would imply the disk pool is capable enough to supply the data at 10GbE.

The iostat output is what I got when I entered your command in the shell. If you meant for me to run it while copying the 100GB file or if I need a different command, please let me know.

NickF1227 · August 28, 2024, 7:25pm

In other words, some of the data of the file in question was already in RAM. Some more of that data was prefetched because ZFS tried to intelligently help. Then, at some point in the file transfer, it could not prefetch fast enough and the data was read from disk. Thats when your slowdown occurs.

ARC is exclusively a read cache.
Performance tuning — openzfs latest documentation

ZFS 101—Understanding ZFS storage and performance | Ars Technica

Because you are reading from RAM in the ARC cache.

I meant for you to run that command while you were copying a large 100GB file from the TrueNAS to your local desktop.

Its being read from ARC, then hitting the disks. ARC is that intermediary cache, ZFS does natively support it.

NugentS · August 28, 2024, 7:26pm

No - RAM is not used for write caching per se. By default TrueNAS will buffer up to 5 seconds of writes (and a GB limit - not sure how much) in memory as a transaction log which is then flushed to disk - which is the limit of “write caching”. The rest of ARC (Adaptive Read Cache) is used for caching frequently used data

Stux · August 28, 2024, 7:42pm

Adaptive Replacement Cache, and as noted is not exclusively a read cache.

Incoming data enters ARC before it hits the disk.

Videodrome · August 28, 2024, 7:43pm

Thank you, quite an eye opener.
I reran your command while picking a ‘fresh’ large file so truenas would not have known to cache it.

This is the output, extremely disappointing to say the least.
HDD’s should be capable of 10 times that, I’d settle for 4 times.

                                                  capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool                                            alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
LAYER01                                         83.0T  91.6T  2.75K    122   289M  12.2M    1ms    7ms    1ms    1ms   48us  927ns  424us    6ms      -      -
  raidz2-0                                      83.0T  91.6T  2.75K    122   289M  12.2M    1ms    7ms    1ms    1ms   48us  927ns  424us    6ms      -      -
    da9p2                                           -      -    242      9  25.0M  1.01M    1ms    6ms    1ms    1ms   61us    7us  458us    5ms      -      -
    gptid/426115c5-f294-11ee-9f4e-a0369fbb5fdc      -      -    229      9  23.6M  1.01M    2ms    6ms    1ms    1ms   58us  456ns  475us    5ms      -      -
    gptid/41d0772f-f294-11ee-9f4e-a0369fbb5fdc      -      -    231     10  23.8M  1.01M    1ms    6ms    1ms    1ms   47us  289ns  437us    5ms      -      -
    gptid/4246b2f6-f294-11ee-9f4e-a0369fbb5fdc      -      -    241     10  24.9M  1.01M    1ms    7ms    1ms    1ms   35us  271ns  311us    6ms      -      -
    gptid/4241a174-f294-11ee-9f4e-a0369fbb5fdc      -      -    229     10  23.6M  1.01M    1ms    8ms    1ms    1ms   55us  433ns  482us    6ms      -      -
    gptid/423d2151-f294-11ee-9f4e-a0369fbb5fdc      -      -    231     10  23.8M  1.01M    1ms    6ms    1ms    1ms   44us  264ns  404us    5ms      -      -
    gptid/423fd4a2-f294-11ee-9f4e-a0369fbb5fdc      -      -    242     10  24.9M  1.01M    1ms    7ms    1ms    1ms   39us  337ns  332us    6ms      -      -
    gptid/4236038d-f294-11ee-9f4e-a0369fbb5fdc      -      -    229     10  23.6M  1.01M    2ms    8ms    1ms    1ms   63us  453ns  422us    7ms      -      -
    gptid/425a4c99-f294-11ee-9f4e-a0369fbb5fdc      -      -    231     10  23.8M  1.01M    1ms    7ms    1ms    1ms   44us  301ns  394us    6ms      -      -
    gptid/423800ba-f294-11ee-9f4e-a0369fbb5fdc      -      -    241     10  24.9M  1.01M    1ms    7ms    1ms    1ms   37us  399ns  350us    6ms      -      -
    gptid/426374db-f294-11ee-9f4e-a0369fbb5fdc      -      -    229     10  23.6M  1.02M    1ms    7ms    1ms    1ms   51us  333ns  586us    6ms      -      -
    gptid/425e5d34-f294-11ee-9f4e-a0369fbb5fdc      -      -    231     10  23.8M  1.02M    1ms    7ms    1ms    1ms   44us  342ns  446us    6ms      -      -
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
boot-pool                                       3.04G   213G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
  ada0p2                                        3.04G   213G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

NickF1227 · August 28, 2024, 7:51pm

It looks to me that the bottleneck here is in IOPS.
Each of your disks is showing over 200 IOPS with an aggregate a bit shy of 3,000. This is about the most that can expected from a hard drive.

There’s a fundamental difference between IO and throughput. You have an IO bottleneck which is manifesting itself by showing you reduced throughput from what would otherwise be possible.

Spinning hard drives have to sping around at 7200 RPM to find a sector to read back. Theres a latency cost.
IOPS - Wikipedia

Having 2 VDEVs would double your IO potential and therefore your throughput would increase.

NugentS · August 28, 2024, 7:53pm

Oops

Videodrome · August 28, 2024, 8:05pm

Thank you for your help in identifying the issue.

A bit jarring seeing how it is one large sequential file and I need 12 disks in raidz2 to match one standalone drive.

Time to add 36 more disks I guess…

NickF1227 · August 28, 2024, 8:13pm

A VDEV width >8 on Z2 is really not very good, and you only have 1 VDEV.

With your same amount of disks you have, but configured as 2 6W RAIDZ2 VDEVs you’d not only have MORE performance but far more RELIABILITY

R2-C2 (jro.io)

Videodrome · August 28, 2024, 8:53pm

I’ve taken a look at that website and it’s very helpful.
In the future I might go for a 4x6 or 3x8 setup both with 2 drive parity per vdev.

4 drives of parity (33%) is more than I’m willing to sacrifice at this moment.
The data is not that sensitive, just a major inconvenience if lost.

Videodrome · August 30, 2024, 4:50am

It bothered me that, with the same amount of I/O, I could write at nearly 10GbE speeds but could only read at a fraction of that.

However, after adding a few tunables, the issue seems to be resolved. I’m now getting read speeds of 700-1000MB/s.

Glad I could resolve this.

                                                  capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool                                            alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
LAYER01                                         84.2T  90.4T  6.25K    102   870M  10.7M   10ms   10ms  935us    2ms  107ms    1us    9ms    8ms      -      -
  raidz2-0                                      84.2T  90.4T  6.25K    102   870M  10.7M   10ms   10ms  935us    2ms  107ms    1us    9ms    8ms      -      -
    da9p2                                           -      -    566      8  78.5M   911K    7ms   10ms  857us    2ms  172ms  716ns    6ms    8ms      -      -
    gptid/426115c5-f294-11ee-9f4e-a0369fbb5fdc      -      -    503      8  68.1M   909K   14ms    9ms  953us    2ms   98ms  424ns   13ms    7ms      -      -
    gptid/41d0772f-f294-11ee-9f4e-a0369fbb5fdc      -      -    516      8  70.9M   907K    9ms    9ms  961us    2ms   67ms  392ns    8ms    7ms      -      -
    gptid/4246b2f6-f294-11ee-9f4e-a0369fbb5fdc      -      -    569      8  78.5M   909K    8ms   10ms  944us    2ms  155ms  364ns    7ms    9ms      -      -
    gptid/4241a174-f294-11ee-9f4e-a0369fbb5fdc      -      -    507      8  68.1M   910K   13ms   10ms  991us    2ms   95ms  356ns   12ms    8ms      -      -
    gptid/423d2151-f294-11ee-9f4e-a0369fbb5fdc      -      -    532      8  70.9M   910K    8ms   11ms  890us    2ms   81ms  356ns    7ms    9ms      -      -
    gptid/423fd4a2-f294-11ee-9f4e-a0369fbb5fdc      -      -    581      8  78.5M   911K    7ms   11ms  860us    2ms   94ms  352ns    6ms    9ms      -      -
    gptid/4236038d-f294-11ee-9f4e-a0369fbb5fdc      -      -    512      8  68.2M   909K   10ms   10ms    1ms    2ms  105ms  324ns    9ms    9ms      -      -
    gptid/425a4c99-f294-11ee-9f4e-a0369fbb5fdc      -      -    524      8  71.0M   909K   10ms    9ms  960us    2ms   60ms  368ns    9ms    7ms      -      -
    gptid/423800ba-f294-11ee-9f4e-a0369fbb5fdc      -      -    556      8  78.5M   911K   10ms   11ms  958us    2ms   97ms   18us    9ms    9ms      -      -
    gptid/426374db-f294-11ee-9f4e-a0369fbb5fdc      -      -    504      8  68.1M   912K   15ms   10ms  955us    2ms  102ms  352ns   14ms    9ms      -      -
    gptid/425e5d34-f294-11ee-9f4e-a0369fbb5fdc      -      -    524      8  70.9M   911K    9ms   11ms  894us    2ms  136ms  356ns    8ms    9ms      -      -
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
boot-pool                                       3.04G   213G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
  ada0p2                                        3.04G   213G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

Tunables:

vfs.zfs.vdev.sync_write_min_active=10
vfs.zfs.vdev.sync_write_max_active=10
vfs.zfs.vdev.sync_read_min_active=10
vfs.zfs.vdev.sync_read_max_active=10
vfs.zfs.vdev.async_read_min_active=1
vfs.zfs.vdev.async_read_max_active=1
vfs.zfs.vdev.async_write_min_active=1
vfs.zfs.vdev.async_write_max_active=10
vfs.zfs.zfetch.max_distance 2147483648

etorix · August 30, 2024, 5:22pm

Good. Do you have documentation links to share about these tunables?

HoneyBadger · August 30, 2024, 5:30pm

The ones changed from the CORE defaults are

vfs.zfs.vdev.async_read_max_active=1 (default 3)
vfs.zfs.vdev.async_write_max_active=10 (default 5)
vfs.zfs.zfetch.max_distance=2147483648 (default 67108864)

Reducing the number of async read I/Os is unusual but it perhaps is helping the prefetcher - increasing the max_distance though allows each read stream to prefetch much further (2G vs 64M) so that’s probably the pivotal piece - however you might have a significant impact on more random I/O here, as if your disk is busy with 2G of speculative prefetch, a non-sequential I/O may end up “behind that” in the queue.

Videodrome · August 31, 2024, 5:53am

The “documentation” came from posts on Reddit and other forums where people had the same problem I did and used these settings.
https://old.reddit.com/r/zfs/comments/cjck6g/slow_read_but_fast_write_performance/

Can’t find the max_distance reference right now, but it’s out there.

I’m fine with random I/O being slower. This server mainly handles media files over 10GB in size, the smaller files are rarely accessed and are there just for backup purposes.

It would be great if TrueNAS could add some documentation for optimizing servers that handle large media files, as these are becoming more popular.
Or having auto-tunables detect this and optimize accordingly.
Or providing a setup option where you’re asked how the server will be used, whether for large media, VMs, databases, etc.

bonox · August 31, 2024, 9:27am

just out of interest, what’s your block size on the dataset? And if it’s small and you just deal with large files, create a test dataset with 1MB and test - does performance get better? Just about doing more with the limited IOPS you have. Latency will go up of course, but that’s not generally concerning for large file streams.

Videodrome · August 31, 2024, 9:37am

1MB blocksize
LZ4 compression
sync disabled
atime off
dedup off

I’m satisfied with the performance since setting the tunables.
I’d imagine setting the max_distance to 10GB+ could help, but I believe the current setting is the maximum Truenas allows.