L2ARC (Metadata Only) Disk Replacement

volts · October 18, 2024, 11:05am

Does that work? There’s an Issue on the ZFS GitHub repo requesting basically this as a feature. L2ARC metadata refresh by zpool scrub · Issue #16416 · openzfs/zfs · GitHub

Now I’m curious to understand scrub a bit better myself.

Only a portion of the data that’s eligible for eviction from ARC is considered for L2ARC. With a ton of ARC nothing hot may ever become eligible for eviction, and those subsequent reads may just be served from the ARC.

There are a couple tunables that can help move more data to L2ARC (l2arc_noprefetch and l2arc_headroom and l2arc_write_boost and friends) but with enough RAM a big ARC may not get much eviction pressure. L2ARC metadata caching partially broken · Issue #15201 · openzfs/zfs · GitHub

This can cause significant read amplification - are you confident it’s improving things without side effects? A nice explanation here: metadata caching does not work as expected - repeatedly getting lost in arc and l2arc with primary|secondarycache=metadata · Issue #12028 · openzfs/zfs · GitHub

Constantin · October 18, 2024, 11:44am

I’m just giving you my lived experience. The time it took to crawl my iTunes directories and files was very much a function of how many times the L2ARC had an opportunity to note a metadata “miss” followed by keeping that metadata handy for the next crawl.

I tried setting some of those tuneables but they didn’t help much. Ultimately iXSystems came for the rescue when they allowed the L2ARC to become persistent.

winnielinnie · October 18, 2024, 1:14pm

How? I want it to read from the drives. (SSDs, in my case.)

The point is that it should always read from the drives^[1].

Let’s say my “seeding” dataset is on an SSD-only pool, and it contains 1 TiB of data.

Constantly seeding with a torrent application will have random reads from anywhere in this 1-TiB slab of data. Not only are the reads random, but they are “one-and-done”. Once a block is served to a peer, it’s very unlikely to be served again, since there are hundreds of peers (sometimes more) all requesting different chunks of data from different torrents.

How on earth could my 32 GiB of RAM effectively manage 1 TiB of random “one-and-done” chunks in the ARC?

Now consider that I want my ARC to be used for things that actually matter, such as for spinning drives containing data that is accessed more “repeatedly” and predictably, in which I want to lessen the number of reads from the HDDs themselves.

With the exception of qBittorrent’s own working cache. ↩︎

volts · October 18, 2024, 6:54pm

Oh! I love L2ARC for metadata, and my experience matches yours. I’m just saying there’s no perfect and reliable way to warm it up. Getting all of the metadata into ARC is the first step, but there’s no promise it will all then move to L2ARC.

I agree - enable persistent L2ARC, increase the warm-up sysctls, exercise it a bit. And then wait and let it finish populating “eventually”.

If OP isn’t seeing many L2ARC hits during regular use, but ARC hits are high … it’s possible L2ARC is only helping him shortly after reboot, before the metadata has gotten into ARC.

I think bittorrent does most IO in 16kiB blocks, so I think this is partly a question of app behavior as much as anything - qbittorrent, I assume?

If qbittorrent is fetching a 1MB piece from disk, it issues 64 reads.
With primarycache=metadata each of those reads will get amplified up to the whole ZFS recordsize.

I have a 1MB recordsize on my big torrent volume. If I change to primarycache=metadata my read request rate goes absolutely through the roof. It immediately saturates my devices. It’s quite visible!

Protopia · October 18, 2024, 7:11pm

All this effort to “warm up” an L2ARC seems like a LOT of effort. I can only presume that the performance benefits are worth it (compared, say, to buying a tranche more memory to increase the ARC).

winnielinnie · October 18, 2024, 7:23pm

We might be “talking past each other”.

I don’t mind any of that. Let it read more than it should, for the data/chunk to be used by qBittorrent to “one-and-done” send a chunk to a peer. Just don’t ever cache the data blocks it in the ARC, whether it is a 16 KiB block or a 1 MiB block.

Knock yourself out with as many “read requests” as you want! It’s an SSD-only pool meant to remove the I/O from my main (and important) HDD-spinner pool. In fact, everything being “seeded” already lives on the main pool and in subsequent backups. (So to lose this SSD pool isn’t a big loss.)

I just don’t want any blocks from this massive data slab of torrents (Linux ISOs) to compete for a home in the ARC.

I can word this another way: If my “seed” dataset could live on a separate machine with a non-ZFS filesystem, then I would do so. (It would “leave my ARC alone”, since it’s on a completely different system.) But that would be wasteful and cost more money, as the data can be served from the very same TrueNAS box that holds my other pools.

EDIT: In fact, I could probably just set primarycache=none for my “seed” dataset. Let it “read” the metadata straight from the SSD for all I care.

volts · October 18, 2024, 8:23pm

Maybe!

I’m saying that to get each one-and-done piece out the door, with primarycache=metadata, it’s read from disk (zfs recordsize) / (16KiB torrent block size) multiple times - a potentially huge number. With primarycache=metadata the whole zfs record gets brought from disk for each little read!

If zfs recordsize is the default 128KiB, that’s still putting 8x as many I/O requests and bandwidth on the device and bus. If the zfs recordsize is 1MB that’s 64x!

It’s quite possible that this will saturate even a fast device and bus. You might be being overly pessimistic about the ARC and optimizing the “wrong” thing.

Did you configure this because of bad behavior in the past? The ZFS ARC MFU/MRU & metadata changes have been very good to me.

winnielinnie · October 18, 2024, 8:44pm

How is this different from primarycache=none or primarycache=all ?

This sounds like a problem of 1-MiB recordsize for seeding torrents (with supposedly 16-KiB “blocks”).

Speaking of 16 KiB, I’m not sure how relevant that is anymore? Large torrents usually have “piece sizes” larger than that. Even the torrent for “LibreOffice” (which is only 340 MiB) uses a 256-KiB “piece size”. The same is true for the Ubuntu ISO torrent.

Each piece (sometimes referred to as “chunk”) has its own checksum.

Maybe I’m dense, but how is this related to the ARC? What if I just set primarycache=none for this one dataset? How would anything fundamentally change? qBittorrent will still have to pull a 1-MiB block from the SSD for every “piece” requested by a peer. A piece likely being 256 KiB in size. Okay, so it had to read an extra 768 KiB? Yes, that consumes RAM, but qBittorrent is limited to how much RAM it can use, and these “wasteful reads” are not contributing to pressure in the ARC whatsoever.

Running (seeding) 24/7, qBittorrent never exceeds 1 GiB of RAM in its usage.

Am I wrong to think that “metadata” refers to non-file data? (Path, size, timestamps, filename, location on filesystem, etc).

volts · October 18, 2024, 9:46pm

These two will behave the same -
primarycache=none = data not eligible for ARC
primarycache=metadata = data not eligible for ARC

primarycache=all = data eligible for ARC

If data isn’t eligible for ARC, any app performing multiple smaller-than-recordsize reads will see this amplification effect.

Yeah, or any app, really. 1MiB recordsize is a pretty good compromise for big files, which are usually read semi-sequentially, assuming the ARC is enabled for data. But it’s a bad idea if the ARC is disabled for data.

That’s the size of torrent IO operations, both between clients and to disk. The torrent checksumming is done in the app at the larger 256KiB/512KiB/1MiB piece size.

That’s what I’m saying, it’s way worse than that!

Let’s say it’s that LibreOffice torrent with a 256KiB piece size.
Somebody asks for a piece.
qBittorrent will issue 16 * 16KiB read requests to get the 256KiB piece.

Let’s say the filesystem has a 1MiB recordsize.

With primarycache=none or primarycache=metadata, every single one of those 16KiB read requests will be expanded out to the 1MiB recordsize, and will be satisfied directly from disk.

So getting that 256KiB torrent piece from disk requires 16 read operations, and generates 16MiB of I/O.

With primarycache=all, the first 16KiB read request is still expanded out to the 1MiB recordsize. But the second read request, and all subsequent requests for that piece, are satisfied from ARC. ZFS only generates one read operation and only reads 1MiB from disk.

I don’t think this would change qBittorrent’s RAM usage either way. This would be invisible to qBittorrent.

That’s how I think of it too.

winnielinnie · October 18, 2024, 10:34pm

So far so good, and as intended.

What amplification? Against the SSDs? This is the part I’m confused about. I have no qualms whatsoever about my SSD-only pool being “hammered”, even if it’s “sacrificial”.

See my above comment. My SSD-only pool (of whose greatest “activity” is from qBittorrent) is almost “sacrificial” in its purpose. As long as it does not involve the ARC, it can “hammer away” at the SSD for all I care. (Keep in mind, it’s almost purely “read” operations, and rarely “write”. Most of the writing is due to the “System Dataset”.)

My main pool (HDD spinners) cares not of the fate of my SSD-pool.

But that 1 MiB in ARC is multiplied by the hundreds of peers requesting random pieces from a massive “slab” of data.

I would rather the SSDs get hammered (and “over”-read), than even temporarily use the ARC. (I want the ARC to be almost-exclusively used and “pressured” by my HDD pool. Especially metadata from my HDD pool.)

I haven’t “felt” this read amplification. It might even be possible that libtorrent-rasterbar (the library used by qBittorrent) has been improved, in that it no longer reads 16 KiB “subpieces” one-by-one, but rather just grabs the entire piece into RAM.

The only mentions of 16 KiB subpieces (and their requests by peers) that I can find are fairly dated (from 2005 - 2014).

Considering how much more RAM we have since then (and how it’s not “costly” to just read and hold an entire piece in the application’s working memory), it might be possible that this behavior has since changed with certain torrent applications. So maybe a request for a 16 KiB subpiece actually just loads the entire 256 KiB piece into (non-ARC) working memory, in which further immediate requests, for subpieces within, are done all in RAM?

(I’m not a code reader, so I wouldn’t know where to look in the source code.)

Constantin · October 18, 2024, 10:51pm

For one, some motherboards may not allow upgrading beyond a certain point. Or, motherboards may be so finicky re: memory, that upgrading them past stock is simply unrealistic.

One such motherboard was the mini-ITX in the iXsystems Mini, MiniXL, etc. aka Asrock Rack C2750D4I which came with 32GB RAM from the factory and could theoretically support up to 64GB unbuffered UDIMM ECC RAM, but only if you knew the secret handshake, were willing to troll eBay for months or pay $$$ for basically unobtanium ECC RAM sticks off the QVL.

Contrast that with the relative ease of retrofitting a metadata L2ARC and watching in glee as rsync tasks for file-intensive loads like an iTunes folder executed 3-16x faster with the metadata L2ARC vs. stock. It really adds up.

Let’s just say that adding a L2ARC (metadata only, persistent) was a lot easier than finding the relevant RAM sticks for the C2750D4I, and a lot more reliable too. Not every use case will be similar, your mileage will vary, etc.

Stux · October 18, 2024, 11:01pm

This would be consuming bandwidth to the SSD and subsystems, that the ARC is supposed to be used to decrease.

And yes, I suspect that the “1MiB” block in arc would get read from a few dozen times before the client moves on.

volts · October 19, 2024, 1:30am

I see the iops and bandwidth instantly in zpool iostat 10 if I set primarycache=metadata on a torrent pool. It’s pretty easy to saturate a SATA SSD with the amplification, even with very modest demand.

… why not let ARC have it?

Do you actually see the ARC getting destroyed by torrent traffic? I don’t.

winnielinnie · October 19, 2024, 1:49am

These are NVMe SSDs. Not sure if that makes much of a difference.

It’s not a matter of the ARC getting destroyed. I’ve already experienced aggressive metadata eviction from ARC (due to data pressure) prior to Open ZFS 2.2.x. I just want to minimize any possible pressure to my metadata in the ARC.

When metadata gets evicted, I do in fact “feel” it.

To allow any blocks of data from my torrent “slab” in the ARC, even temporarily, is just not worth it for me. Go ahead and pepper my SSDs with “amplified” reads. My NAS performance hasn’t been hit.

However, don’t you dare touch my precious metadata in the ARC, because that effect on performance is felt.

Yes, since the introduction of OpenZFS 2.2.x, the issue of “ARC pressure” against the metadata might in fact be fully resolved. It might not put any pressure on my metadata, even as it holds blocks (from the torrent “slab”) in my ARC.

I might entertain re-enabling primarycache=all to my torrent dataset. But so far, I haven’t felt a reason to do so.

volts · October 19, 2024, 2:28am

I think it might be. I’m so much happier lately. zfs_arc_meta_balance is good stuff.

winnielinnie · October 19, 2024, 2:30am

I set mine to 2000 (default is 500), and I’m also quite happy.

winnielinnie · October 19, 2024, 4:42pm

To throw a little humor in here: The real “read amplification” is from TrueNAS Core 13.3 itself.

Ever since upgrading to 13.3, my boot-pool is being scrubbed every single day.

volts · October 19, 2024, 6:10pm

That’s still not as bad as the read amplification I’m talking about! That’s only 6 more reads than there should be.

winnielinnie · October 19, 2024, 6:12pm

Scrubs use brushes.

Bittorrent uses tender loving care from the community.

There’s a difference.