Large media files have slow read performance. Would adding a special metadata VDEV improve this?

Videodrome · August 26, 2024, 7:07pm

When transferring large video files (10-100 GB) over a 10GbE connection, the speed tends to slow down, fluctuating between 200 and 1000 MB/s.

I’m considering whether a dedicated metadata vdev might improve performance in my situation. I’m not concerned with the performance of small files; my main focus is achieving sustained read speeds for larger files. These files will only be accessed once, so using an L2ARC won’t be beneficial, as multiple reads won’t occur to build it up.

Below is an image of the ZFS filesystem during a 100GB file transfer. As shown, the ARC hit rate sometimes drops to 60%.

I’m running a 12-drive RAIDZ2 pool in a single VDEV configuration on a Xeon system with 96GB of ECC RAM, with only one user accessing it.

NickF1227 · August 26, 2024, 7:11pm

The only real way to speed up your access would be to add VDEVs. 12-wide is pretty wide and the performance penality you are seeing is likely due to the width of that single VDEV. I’d suggest 2 VDEVs of 6 dives would improve your performance, but it would come at a cpacity cost. However, a metadata VDEV would not help here.

Videodrome · August 27, 2024, 5:24am

I opted for a single VDEV setup to ensure more redundancy in case two drives fail. Having two VDEVs isn’t really feasible for me.

The screenshot indicates a high peak in demand_metadata. From my understanding, a special metadata VDEV would allow metadata to be accessed ‘instantly,’ eliminating metadata latency when reading blocks from the HDDs.

Given this, I’m curious why a special VDEV wouldn’t help in this situation. Am I misunderstanding the role of metadata, where it’s only read once at the start of accessing a file? Would this mean a special VDEV is only beneficial for handling many small files and not for a single large file?

NugentS · August 27, 2024, 5:53am

A special vdev is pool critical and cannot be removed. What might work better, and be cheaper in resources for your purposes, is an L2ARC set as metadata only.

Its not pool critical and can be removed if it doesn’t achieve your objectives.

However I tend to agree that in the circumstances posted neither solution will achieve a speed increase

Farout · August 27, 2024, 5:53am

How does it look, when you re-read the same file from cash?

A 12-wide raidz2 should have enough IOPS to read one file.

Also remember, that any special vdev is pool critical, so losing it means your data is toast.
Therefore it should have the same redundancy as your pool . So you would need a 3-way mirror.

Videodrome · August 27, 2024, 6:39am

Thank you. I’m going to explore using L2ARC for metadata first.

I’m aware about the special vdev needing redundancy and was prepared to go the 3-4 mirror route. I’d prefer the l2arc option though, since importing/transferring the hdd’s to another system would be less stressful.

Currently, the first 60-70 GB of a file transfer is quite fast 750+ MB/s, but then the speed drops to 200-300 MB/s. It’s not a huge issue, but it doesn’t make sense to me, and I would like to resolve it.

Right now it’s in the middle of something so in a few days I hope to give an update.

Farout · August 27, 2024, 9:42am

How full is your pool ?

NugentS · August 27, 2024, 10:12am

You will need to populate the L2ARC first. This does NOT happen automatically

Also - your speeds - is that read or write to the pool?

Constantin · August 27, 2024, 11:25am

The sVDEV consists of 2 partitions - 75% of its space is allocated to small files by default, another 25% to metadata - (this can be changed based on your use case)

Speeding up metadata helps with directory traversal, find files, and like tasks. For example, I found a metadata-only, persistent L2ARC as well as a sVDEV very helpful to speed up rsync tasks. Neither have a realistic impact on the transfer rate of large files.

The small files aspect of a sVDEV may have an impact on your read / write speed in case said pool intersperses such files with the larger files you’re also reading/writing. Since sVDEVs are traditionally built using SSDs, small file transactions happen a lot faster. That can have a big impact if there are a lot of small files to deal with - either via the recordsize settings and/or the small file cutoff setting for each dataset.

That is the reason why App pools, virtual machines, etc. used to be housed in separate SSD pools from the main data repositories consisting of HDDs. With a sVDEV, you can tailor each dataset - and some of them can reside 100% in the sVDEV. This ‘fusion’ approach is a great way to make better use of all drives in a NAS, with the caveat that the sVDEV is essential to pool health and has to be at least as redundant as the data drives. I run a 4-wide sVDEV mirror consisting of Intel S3610 SSDs with ridiculous limits re: writes-per-day.

You can see if the L2ARC set to metadata-only and persistent has any impact, provided your NAS has at least 64GB of RAM. Pretty much any decent SSD will do. Since the L2ARC is redundant, a loss of the SSD will not affect pool health (just performance). I found that I needed at least three complete directory traversals before all metadata found its way into the L2ARC (that has to do with the rate with which L2ARC is filled with metadata - the system limits writing ‘missed’ metadata to the L2ARC).

See the sVDEV planning resource for more info.

Videodrome · August 27, 2024, 12:59pm

The pool is currently at 48% capacity, with fragmentation at 3%.
The issues are read related. The receiving end (NVMe) is more than capable, as should be the VDEV with 12 (10) HC550 drives.

I know I need to populate the L2ARC using a command at the start, but my understanding is that this only caches a file index and some small files depending on the settings. It doesn’t cache the location of all the data blocks? This might be useful in some scenarios, but not in mine.

What I essentially need is for all block locations of every file to be cached. Otherwise, how would this approach improve a sequential file transfer that requires faster metadata access, especially in the middle or near the end?

Seeing the hit ratio dip during a long single file transfer while no one else is using the system is very strange to me.

winnielinnie · August 27, 2024, 1:22pm

Large file transfers won’t benefit from caching the metadata.

That would only help if you’re transferring or crawling many small files.

Videodrome · August 27, 2024, 1:26pm

Any idea why the hitratio would drop during a single file transfer? When nothing else is requesting io?

winnielinnie · August 27, 2024, 1:32pm

I think you’re out of luck with this one.

That easily surpasses your physical RAM, and you even said these specific files would not be “repeat” transfers. So you’re at the mercy of the speed of your vdev.

HoneyBadger · August 27, 2024, 2:27pm

Technical details here!

sVDEVs aren’t partitioned in the actual “disk partition” sense, but rather there’s a preconfigured 25% “buffer space” for metadata - prior to a write, ZFS checks for the allocated space and ensures that it’s below the limit.

This also means that in a scenario where you have 25% of your drive full of metadata already, and you enable special_small_blocks it’s going to cut you off when 50% of the drive size has been allocated for small files (less any new metadata.) Only metadata is allowed to eat into that 25% buffer.

@Videodrome an L2ARC set to metadata-only might help here, if that’s truly the bottleneck point, but you might also be hitting a case where prefetch is somehow not keeping up. A single thread of copies should be an ideal workload for it though.

I saw you mention a “Xeon with 96GB of RAM” - can you provide some more information about the system (board/storage controller?) and the version of TrueNAS you’re running?

Constantin · August 27, 2024, 2:49pm

Thank you for the correction and I’ll incorporate that into the sVDEV guide. I hope the rest was ok.

Videodrome · August 27, 2024, 5:13pm

Intel(R) Xeon(R) CPU E5-2699 v3
Asrock X99WS
LSI 9300
TrueNAS-13.0-U6.2
Intel X540-T2 nic

I’ll add a 500GB NVMe L2ARC in a few days as a test to see if it improves performance. I might upgrade to 13.3 to utilize the zfs_arc_meta_balance setting.

No high hopes though.

Farout · August 27, 2024, 5:19pm

They run very hot. Are you using a server case or an added fan ?

Videodrome · August 27, 2024, 5:37pm

Yes, they do.
However, temperature isn’t an issue.
I have five fans in the case, with an additional fan directed at this card and the NIC. The HDDs and backplane are in a separate enclosure with their own fans.

NickF1227 · August 28, 2024, 2:30am

Just so that I’m clear, I’m not discouraging the use of RAIDZ2, I’m just saying that having a relatively large number of disks in a single VDEV is going to limit performance.

Can you please run zpool iostat -vvyl 360 1
You can adjust the 360 to be closer to the expected amount of time it would take to copy a file, I’m assuming 5 minutes here and the value is in seconds.

I believe you’re hitting the saturation point of your disks, and given the topology I’d expect to see some pretty high disk_wait times.

Here’s an example so you can see what the output looks like

root@rawht[~]# zpool iostat -vvyl 60 1
                                            capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool                                      alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
boot-pool                                 7.92G  46.1G      0     14  7.73K   138K  138us  259us  135us   45us    1us    1us  768ns  218us      -      -      -
  nvme4n1p3                               7.92G  46.1G      0     14  7.73K   138K  138us  259us  135us   45us    1us    1us  768ns  218us      -      -      -
----------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fire                                      1.08T   779G     24  2.30K   261K   153M    1ms    3ms    1ms  180us    1us    1us  459us    3ms      -      -      -
  raidz1-0                                1.08T   779G     24  2.30K   261K   153M    1ms    3ms    1ms  180us    1us    1us  459us    3ms      -      -      -
    1ae05125-1e1c-4ae8-a968-81ecccbfff1b      -      -      6    583  68.1K  38.1M    1ms    5ms    1ms  180us    1us    1us  534us    5ms      -      -      -
    bb2d8647-af62-4f29-a6bc-57eb94fc83dc      -      -      5    589  63.1K  38.1M    1ms    2ms    1ms  179us    1us    2us  675us    2ms      -      -      -
    2afcbef7-d767-48b6-b383-3a7210afc4f6      -      -      6    593  68.5K  38.1M    1ms    2ms    1ms  179us    1us    1us  508us    2ms      -      -      -
    396795ce-4af6-41bb-aa5f-0d591864db4d      -      -      5    588  61.7K  38.1M    1ms    2ms    1ms  182us    1us    1us   35us    2ms      -      -      -

Videodrome · August 28, 2024, 4:28am

This is the output for 3 minutes:

                                                  capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool                                            alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
LAYER01                                         84.8T  89.8T      0    116    454  11.7M   20ms    6ms   20ms    1ms  460ns  330ns      -    5ms      -      -
  raidz2-0                                      84.8T  89.8T      0    116    454  11.7M   20ms    6ms   20ms    1ms  460ns  330ns      -    5ms      -      -
    da9p2                                           -      -      0      9     45  1000K   25ms    6ms   25ms    1ms  384ns  810ns      -    5ms      -      -
    gptid/426115c5-f294-11ee-9f4e-a0369fbb5fdc      -      -      0      9     45  1001K   12ms    6ms   12ms    1ms  768ns  458ns      -    5ms      -      -
    gptid/41d0772f-f294-11ee-9f4e-a0369fbb5fdc      -      -      0      9     45  1001K   25ms    5ms   25ms    1ms  384ns  274ns      -    5ms      -      -
    gptid/4246b2f6-f294-11ee-9f4e-a0369fbb5fdc      -      -      0      9     45  1000K   25ms    5ms   25ms    1ms  384ns  241ns      -    5ms      -      -
    gptid/4241a174-f294-11ee-9f4e-a0369fbb5fdc      -      -      0      9     45  1001K   25ms    6ms   25ms    1ms  768ns  230ns      -    5ms      -      -
    gptid/423d2151-f294-11ee-9f4e-a0369fbb5fdc      -      -      0      9      0  1002K      -    6ms      -    1ms      -  237ns      -    5ms      -      -
    gptid/423fd4a2-f294-11ee-9f4e-a0369fbb5fdc      -      -      0      9      0  1001K      -    6ms      -    1ms      -  249ns      -    5ms      -      -
    gptid/4236038d-f294-11ee-9f4e-a0369fbb5fdc      -      -      0      9     45  1002K   12ms    6ms   12ms    1ms  384ns  266ns      -    5ms      -      -
    gptid/425a4c99-f294-11ee-9f4e-a0369fbb5fdc      -      -      0      9     45  1001K   12ms    6ms   12ms    1ms  384ns  348ns      -    5ms      -      -
    gptid/423800ba-f294-11ee-9f4e-a0369fbb5fdc      -      -      0      9     45  1000K   25ms    6ms   25ms    1ms  384ns  272ns      -    5ms      -      -
    gptid/426374db-f294-11ee-9f4e-a0369fbb5fdc      -      -      0      9     45  1000K   25ms    6ms   25ms    1ms  384ns  257ns      -    6ms      -      -
    gptid/425e5d34-f294-11ee-9f4e-a0369fbb5fdc      -      -      0      9     45  1000K   12ms    6ms   12ms    1ms  384ns  314ns      -    5ms      -      -

Before setting up this server, I used several small 2.5" HDDs to benchmark some configurations. Although the two-VDEV setup had better read performance, none of the tests showed any slowdowns over time.

Now, I’m starting to worry about temperature, even though I thought I had that covered. The slowdown after one minute of transferring data makes it plausible that things could be heating up to a critical point. I’m planning to add three temperature sensors to the HBA, NIC, and backplane heatsinks to monitor what’s happening.