L2ARC tuning guide and common misconceptions

volts · February 5, 2025, 6:15pm

I’ve been using secondarycache=metadata on crappy systems with pretty crappy l2arc devices for a decade. It’s a huge improvement to the feel of a system. Persistent l2arc was a really nice addition for this kind of duct-tape setup.

Nearly any SSD is better than a usb stick, but even they can be an improvement. Even if the transfer rate is pretty low, getting lower seek latency vs. HDDs, and getting a few read operations off of the spinning HDDs, is a big win. It’s easy for even a small device to hold all of the metadata, and I’m not worried about endurance.

SmallBarky · February 5, 2025, 6:51pm

I still say the 64GB for adding is a good rule as RAM and ARC is faster than NVME or SSD added for L2ARC. The addition of RAM to a system has other uses by the OS. A L2ARC device is single use. I think it come down to the money and over all performance. Use money to upgrade 16GB, 32GB or 64GB or adding L2ARC device? Look at how Scale users load the systems up with apps and VMs.

Apps and VMs are usually ran on mirrored SSD or NVME. If they are on NVME, is the same speed L2ARC any benefit?

rungekutta · February 5, 2025, 6:56pm

What would make 64GB some kind of special limit or threshold?

NickF1227 · February 5, 2025, 7:32pm

L2ARC hit rates are consistently lower than most users would expect, all other things being equal.

Oh sure, no argument from me that the heros here are the devs and they need more help.

I’m not opposed to this advice, or of setting the value of l2arc_write_max higher. I just wanted to ellaborate on the other side of the argument thats often unheard. The default is the default for a reason. Making it “stronger bigger faster” is fine as long as the user understands their specific hardware’s limitations.

rungekutta · February 5, 2025, 7:54pm

Ok here comes a table. If I got it right - I think I did - it was a fairly easy hack in Excel.

L2ARC size in GiB as a function of average block size and RAM overhead

How to read: RAM overhead down, avg block size across. E.g. 16 KiB records and 1GiB RAM to spare/invest you want to keep your L2ARC size at 171GB or below. Or the other way around - if you’re planning to install 683GiB L2ARC (odd size I know) and your average record size is 32KiB then your L2ARC RAM overhead will be 2 GiB.

Caveat 1: compression will play a part too. Multiply the RAM requirement with your compression ratio, e.g. 1.2, because 1.2 times as many records will fit in L2ARC (and create correspondingly larger L2ARC record RAM overhead).

Caveat 2: this is based on data, not metadata. I don’t know how metadata works wrt L2ARC and RAM overhead.

The table is based on 96 bytes per record.

	4	8	16	32	64	128	256	512	1024
L2ARC RAM overhead (GiB)
0.5	21	43	85	171	341	683	1,365	2,731	5,461
1	43	85	171	341	683	1,365	2,731	5,461	10,923
2	85	171	341	683	1,365	2,731	5,461	10,923	21,845
4	171	341	683	1,365	2,731	5,461	10,923	21,845	43,691
8	341	683	1,365	2,731	5,461	10,923	21,845	43,691	87,381
16	683	1,365	2,731	5,461	10,923	21,845	43,691	87,381	174,763

SmallBarky · February 5, 2025, 8:22pm

Maybe use the Insert Table option for formatting? Little Star / Settings wheel then Insert Table on upper right of reply box.

rungekutta · February 5, 2025, 8:23pm

Yeah that’s how I did it. And then copied in the data from Excel.

SmallBarky · February 5, 2025, 8:27pm

Did you have spaces or something else as separators? ‘1 000’ ‘1,000’ or ‘1.000’

It just behaves funny.

rungekutta · February 5, 2025, 8:32pm

Yes, had spaces, now replaced with commas.

rungekutta · February 5, 2025, 8:51pm

With l2arc_mfuonly=2, L2ARC stores all metadata but only MFU data. Which seems so obvious that I don’t know why they didn’t immediately make it the default. Maybe in future versions. It prevents cheap but useful metadata being churned out by read-once, fly-by data. Should reduce unnecessary disk overhead and wear and in addition improve eviction quality which should drive up hit rates. Seems more sensible in every way.

Edit: typo

NickF1227 · February 5, 2025, 8:54pm

If you hold this to be universally true, then by that logic why doesn’t ZFS just dump ARC in favor of an MFU cache?

rungekutta · February 5, 2025, 9:00pm

Because something doesn’t become MFU until it’s… MFU. All data goes through the MRU cache during the read process (if it isn’t already in MFU or MRU). If a block is read and was already in the MRU cache, it gets moved to the MFU cache, where it can be sorted according to access hits and eventually evicted if it slips down the list. L2ARC conversely is a dumb ring buffer so the quality of its content is completely at the mercy of the eviction logic from ARC.

Completely different mechanics.

NickF1227 · February 5, 2025, 9:10pm

Right. We’re on the same page here. But ZFS ARC intentionally keeps MRU in a seperate cache.

        Anonymous data size:                          < 0.1 %    5.9 MiB
        Anonymous metadata size:                      < 0.1 %   30.1 MiB
        MFU data target:                               38.6 %   31.8 GiB
        MFU data size:                                 46.2 %   38.0 GiB
        MFU evictable data size:                       44.4 %   36.5 GiB
        MFU ghost data size:                                    22.1 GiB
        MFU metadata target:                            5.6 %    4.6 GiB
        MFU metadata size:                              5.5 %    4.6 GiB
        MFU evictable metadata size:                    2.9 %    2.4 GiB
        MFU ghost metadata size:                                 4.0 GiB
        MRU data target:                               50.1 %   41.2 GiB
        MRU data size:                                 42.6 %   35.1 GiB
        MRU evictable data size:                       40.6 %   33.4 GiB
        MRU ghost data size:                                    20.9 GiB
        MRU metadata target:                            5.7 %    4.7 GiB
        MRU metadata size:                              5.6 %    4.6 GiB
        MRU evictable metadata size:                    2.6 %    2.2 GiB
        MRU ghost metadata size:                                16.8 GiB
        Uncached data size:                             0.0 %    0 Bytes
        Uncached metadata size:                         0.0 %    0 Bytes

In this example from my pool, primay ARC MAINTAINS a pretty even split of MRU/MFU data. It does not have to do this, in a theoretically MFU only world. Tracking the blocks and keeping them in cache don’t have to be inter-dependant.

Lets narrow down a little further. Notice that the MRU/MFU ghost data, or data that was evicted from the cache, are also relatively equal. This means from this pools perspective, the value of MRU/MFU is about equal. The join of the two is the magic of ARC.

        MFU data size:                                 46.2 %   38.0 GiB
        MFU ghost data size:                                    22.1 GiB
        MRU data size:                                 42.6 %   35.1 GiB
        MRU ghost data size:                                    20.9 GiB

If we look at the target values in relation to whats actually in ARC, ARC is actively trying to cache more MRU data than MFU data. This is the exact opposite of what the tunable is doing for you in your L2ARC.

MFU data target:                               38.6 %
MRU data target:                               50.1 %

jro · February 5, 2025, 9:11pm

Tossed this together: L2ARC Footprint Calculator

This looks at the L2ARC footprint question from the other direction – if we know our block size and we know how much RAM we’ve got, how much L2ARC can we install before L2ARC metadata consumes ~10% of total RAM? And 25% of total RAM? For large blocks, it’ll be way more L2ARC than you think.

I’m going to add some more options, but hopefully this is helpful in its current state.

rungekutta · February 5, 2025, 9:30pm

You’re conflating topics here and I don’t see you anyhow explaining how L2ARC could/would be better off being fed with read-once data (MRU) as opposed to data which has per definition been accessed multiple times (MFU).

As per the ARC split between MRU and MFU that you refer to, it’s a totally different scenario. ARC has to estimate how many of the recently read bytes to keep in cache (MRU) on the hope/presumption they will be read again, in which case they are moved to MFU. MRU is low overhead and very fast. MFU is more intelligent but more expensive to update (relatively speaking) and does not consider recency so may accumulate data you are no longer using. No magic involved ;-), the principles of combining the two were first proposed here: https://www.usenix.org/legacy/event/fast03/tech/full_papers/megiddo/megiddo.pdf

rungekutta · February 5, 2025, 9:32pm

Nice work! A dynamic summary version of some of the above.

NickF1227 · February 5, 2025, 9:41pm

I’m not sure how I’m conflating topics at all. L2ARC is, in your own words, “a dumb ring buffer”. When data is evicted from ARC (see ghost data) ZFS will decide whether or not this data is eligible for promotion to L2ARC or if it should be simply evicted. The tunable in question just tells ZFS to evict all MRU data, eliminating its elligibility in L2ARC.

The real life example I provided proves that in some situations ZFS will actively keep a signficant amount of data in both MRU and MFU cache, in expensive RAM, intentionally.

Right. Not sure I’ve said anything to the contrary tho?

HoneyBadger · February 5, 2025, 9:55pm

It’s worthwhile understanding a couple of additional things in addition to @jro 's quick work with the calculator.

recordsize is an upper bound.
Smaller files are stored in smaller records, so don’t think that putting recordsize=1M on a dataset will mean that a hundred 10K files are suddenly going to pack themselves into a single large record.

“Overhead” is only part of the reason for the general recommendation for needing sufficient RAM to really benefit from L2ARC.
If your primary ARC is too small, then data can get pushed out of the tail-end before it’s really able to settle in and be “properly sorted” in a sense by the MFU and MRU algorithms. If you’ve only got 16G of RAM, but your active data set (ADS) is 128G, then while you may be able to, after several rounds of MFUG (ghost) hits, get your truly most important ~16G or so worth of data (if it gets a hypothetical “MFU index = 4” to stick) the rest of it will probably be getting pushed out of ARC and there won’t be enough time to sort between what deserves to be “MFU index = 3” vs what’s “MFU index = 2” and push the actual valuable data to L2ARC.

l2arc_mfuonly is a likely valuable tunable that needs even more testing.
Having L2ARC not even investigate the MRU tail of ARC means that you’re less likely to have things like a backup job’s verify pass blow out both ARC and L2ARC. With L2ARC being a ring buffer as mentioned before, this is extra-penalizing because it’s harder/slower to populate it.

This is done with the ghost lists. Hit on something in the ghost list on the MFU side and it’ll re-read it and expand the balance towards MFU. Hit on an MRU ghost and it’ll do the same the other way.

I’ve recently rebooted my personal “home prod” unit so it’s still sitting at 50:50 MFU/MRU but I expect it to lean more strongly MFU once it’s full, because, well:

ARC states hits of all accesses:
        Most frequently used (MFU):                    99.5 %       1.2G
        Most recently used (MRU):                       0.5 %       6.0M
        Most frequently used (MFU) ghost:               0.0 %          0
        Most recently used (MRU) ghost:                 0.0 %          0

Other L2ARC tuning like secondarycache=metadata on datasets will of course also adjust the “minimum requirements” to make it usable. (An sVDEV would be faster but that’s not the purpose of this thread, and adds its own caveats.)

I believe I mentioned this in the last podcast, but:

I know one little company with a vested interest in OpenZFS becoming even more awesome that might be investigating this.

rungekutta · February 5, 2025, 10:05pm

No offence but I guess my problem is you pull in various directions and throw in weird conclusions which imply that you don’t really understand how this works. As mentioned, the way these concepts are used in ZFS, if there is no MRU there can also not be any MFU. Another way of implementing / thinking about it would have been to use MFU-only and weigh the low-scoring blocks by time in order to give recent hits a chance to climb, reserving relatively more memory to them than they deserve strictly by number of hits, while giving up on and flushing out the older ones. Using two separate data structures to achieve the same, of which one is a simple (and very fast) linked list, is simple and very efficient way of doing it.

Not to be compared with L2ARC which is altogether different.

NickF1227 · February 5, 2025, 10:14pm

I think the disconnect here is that I am stating ARC actively keeps the actual MRU data blocks themselves, in RAM, by design. It does not have to store the actual data itself in RAM in order to algorithmically determine what blocks are Most Frequently Used, but it does anyway.