L2ARC tuning guide and common misconceptions

Arwen · February 4, 2025, 4:54am

We have had lively discussions of ZFS’ L2ARC, as L2ARC has both changed over the years, and how to tune the L2ARC has improved. However, this discussion was taking up much of the responses in this thread:

Here are common misconceptions:

Size should be about 5 times RAM and no more than 10 times RAM. This is no longer true.
L2ARC should not be added until RAM is maxed out. While more RAM is better / faster, using L2ARC on even 16GB RAM configurations is practical.
L2ARC takes up lots of RAM, reducing ARC. The individual record overhead for L2ARC in RAM was reduced from 180 bytes to 96 bytes.
ZFS Cache / L2ARC adding as reduced size | TrueNAS Community
L2ARC needs to be populated after each pool import. We now have persistent L2ARC, which improves improves it’s usability. Note that it can take time to re-read in the L2ARC record pointers into RAM at pool import.
Maintaining original compressed records in L2ARC has improved how much can be stored in L2ARC.

These are current recommendations:

Monitor your ARC and L2ARC usage with the ZFS supplied script:
arc_summary
Or if the output of arc_summary is too long / complex, you can use something like this;
arcstat -f time,size,hit%,l2size,l2hit% 10 10
L2ARC size verses RAM header overhead can be computed with this handy calculator:
L2ARC Footprint Calculator
If you get 20% or more hit rate on L2ARC, it is doing a good job. (This is unlike ARC, which should be getting close to 90% to be considered doing a good job.)
When using quite small block / recordsize, this can increase RAM usage because L2ARC headers are per block / record. So a huge L2ARC is less useful on low RAM servers in those configurations.
Using a L2ARC as a test for performance improvement can help determine if you would benefit from a Special Allocation device / vDev. They are not the same, but if a L2ARC does not help at all, then it is possible your work load may not reap much benefit from a Special Allocation device / vDev.

Are their any items to add?
Please keep the discussion on topic to L2ARC.

richardm · February 4, 2025, 5:44am

@NickF1227 it’s fairly new: Enable L2 cache of all (MRU+MFU) metadata but MFU data only by shodanshok · Pull Request #16402 · openzfs/zfs · GitHub

It has no impact on ARC; for only L2ARC does it exclude MRU data.

etorix · February 4, 2025, 7:35am

What would be the updated size guidance then?
I assume that “any size will do” is NOT valid advice.

SmallBarky · February 4, 2025, 7:55am

I think the current guidelines in the Documentation - Scale Hardware Guide are still good

"The most important quality to look for in an L2ARC device is random read performance. The device must support more IOPS than the primary storage media it caches. For example, using a single SSD as an L2ARC is ineffective in front of a pool of 40 SSDs, as the 40 SSDs can handle far more IOPS than the single L2ARC drive. As for capacity, 5x to 20x more than the RAM size is a good guideline. High-end TrueNAS systems can have NVMe-based L2ARC in double-digit terabyte sizes.

Remember that for every data block in the L2ARC, the primary ARC needs an 88-byte entry. Poorly-designed systems can cause an unexpected fill-up in the ARC and reduce performance. For example, a 480 GB L2ARC filled with 4KiB blocks needs more than 10GiB of metadata storage in the primary ARC."

I think the Documentation and Resources just need to be consistent in the advice given for general purpose use.

rungekutta · February 4, 2025, 7:56am

It is completely dependent on record size. ZFS default is 128kB, the iX l2arc guidance is based on 4k, whereas at the same they time recommend 16k records as the minimum… 128 vs 4k is a 32x difference… so my thinking is any recommendation that doesn’t factor in record size is either super inaccurate or super conservative or both.

Arwen · February 4, 2025, 8:26am

@rungekutta’s comment about block size making a significant difference precludes a hard and fast recommendation.

Theoretically we can show a straight forward table with memory size and block size being the 2 factors giving a rough L2ARC size. But, that is too much math for me right now…

rungekutta · February 4, 2025, 12:48pm

Yeah, something like that, for the actual sizing.

iX recommends 8GB RAM as ”baseline” for system, services and basic operation with 8 drives (TrueNAS Hardware Guide | TrueNAS Documentation Hub).

Building on that, if ignoring apps/virtualization, anything >8GB could be considered candidate for ARC + L2ARC overhead, or at least up to 80% of total RAM. So with 16GB total: 0.8x16-8=4.8GB or with 32GB, 0.8x32-8=17.6GB. Etcetera.

Let’s play with the idea that at least 90% of this should be available for ARC and the remaining a candidate for L2ARC overhead. That would afford 480MB for L2ARC overhead with 16GB RAM, and 1.8GB with 32GB. Etcetera.

Then comes your table… as to what this in turn means in practice in terms of total L2ARC size, as function of record size.

rungekutta · February 4, 2025, 12:51pm

And other considerations for an overall tuning guide could include hardware considerations, parameters, how to measure and understand performance.

etorix · February 4, 2025, 7:14pm

This. Any day for a general purpose guidance.
We should not be dealing with fine tuning here—and not really discussing what might eventually become a “Resource”.

Stux · February 4, 2025, 9:38pm

I’m concerned about the L2ARC fill rate setting (whatever it’s called)

Previously it was very low. Is it still? Should it be increased?

It seems to me that a multi TB L2ARC, which seems like a good idea on a system with a decent amount of RAM, would need to fill faster than the L2ARC fill rate has traditionally been.

(I’m sorry I haven’t looked into the current settings)

NickF1227 · February 4, 2025, 10:21pm

l2arc_write_boost and l2arc_write_max both still default to 8MiB in upstream OpenZFS. In years gone by, these two tunables were often recommended to be tuned higher. In a world with persistent L2ARC, I’m not sure l2arc_write_boost holds as much value as it once did.

https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-write-boost

Klara (a la Jim Salter) has a blog post on L2ARC Tuning, although its a few years old now. He warns about making the l2arc_write_max value too large, so as to basically not nuke your SSD by consuming all of the write endurance.

l2arc_write_max is the standard L2ARC feed throttle, and defaults to 8MiB/sec—an absurdly low value, on the surface of it, for a modern SSD. But let’s do some back-of-the-napkin math, and figure out what it means to continually feed a CACHE vdev at 8MiB/sec:

8MiB/sec * 60sec/min * 60min/hr * 24hr/day == 691,200MiB/day == 675GiB/day

I’m certainly not opposed to suggesting folks increase this value, but there’s always a trade-off that’s workload dependant. Allowing a couple of days for your 1TiB L2ARC to “heat up” does not seem too insane though.

NickF1227 · February 5, 2025, 1:24am

@Arwen arc_summary may be a bit overwhelming for the lay person to parse. I think starting with arcstat may be a better first step.

root@testbox[~]# arcstat -f time,size,hit%,l2size,l2hit% 10
    time   size  hit%  l2size  l2hit%
20:22:39   123G   100   54.7G     0.0
20:22:49   123G  91.8   54.9G     0.0
20:22:59   124G  88.5   55.0G     0.0
20:23:09   124G  98.8   55.1G     0.1
20:23:19   124G  87.2   55.2G     0.6
20:23:29   124G  89.9   55.3G     0.0
20:23:39   125G  88.0   55.5G     0.0
20:23:49   124G  93.6   55.8G     0.0
20:23:59   124G  89.9   56.3G     0.0

Far more columns are available if you do arcstat -v, I just chose what I think is most straightforward for alot of people. In the above I am choosing to sample 10 seconds of data an infinite quantity of times, but that can also be adjusted.

FWIW I also just added the L2ARC to this pool, and it’s already 60 GiB of data.

Arwen · February 5, 2025, 2:53am

Good point, I will update the first post with your suggestion. Though, I will limit it to 10 times instead of infinite, as some newish users may find that less manageable.

NickF1227 · February 5, 2025, 4:16am

Ahh I see. It seems the docs on the ZFS site may not have been updated to include the information in that more-recent change (making “2” a valid choice)
https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#l2arc-mfuonly

In either case, while I am sure there are usecases where setting this tunable to “2” can makse sense, I would struggle to suggest making this a general recommendation.

This paper is rather old. But it does a good job of describing the concept of ARC by its original creators. I’m not sure if there are newer/better resources out there, but this historical gem highlights that there are way smarter people than you or I

ZFS ARC ( and L2ARC) intentionally caches both Most Recently Used and Most Frequently Used blocks. To me, it seems almost sacrilege to suggest that a user set l2arc_mfuonly=2.
It’s like saying we should abandon all hope that ARC will intelligently do the best job.

Can you post the contents of arc_summary on your system and describe your workload? I found this discussion just now and seems interesting. To l2arc_mfuonly or not to l2arc_mfuonly? - OpenZFS - Practical ZFS

richardm · February 5, 2025, 9:16am

Again, the tunable impacts only L2ARC. It’s doesn’t adjust core ZFS caching (i.e. ARC). L2ARC is often regarded as a simple ring buffer and the mechanism by which it’s “fed” is not ideal – OpenZFS developers have admitted as much on multiple occasions spanning more than a decade. L2ARC does not have the “brains” described in that fascinating IBM whitepaper. It sits in a doghouse and soaks-up table scraps.

I’m 24 hours into my Beat L2ARC Like a Rented Mule experiment. My host has two L2ARC devices totaling 690GB. I’ve cut its RAM from the usual 96GB down to 16GB of RAM. The usual advice for L2ARC to not exceed RAM by >10x is thoroughly violated here (43x).

I’m looking to disprove two things:

L2ARC is not useful/helpful.
L2ARC is particularly unhelpful on devices with constrained RAM.

My test case is not ideal as my host is used almost entirely for block storage via iSCSI. I gather OpenZFS in a home or small business is generally used for NAS. But after thinking about it two my five iSCSI volumes on ZFS could become SMB shares. Maybe. The candidates:

380k files with 335k sized <64kB. Currently formatted NTFS w/ 4k clusters.
7600 PNG images around 1.5MB each. 64kB NTFS.

#1 will be interesting because moving 380k files onto “native” ZFS will trigger an explosion of metadata. I’ll have to try it with multiple recordsizes and check both space efficiency and cache efficacy. If conventional wisdom is correct my [undersized] ARC will be overwhelmed with excessive L2ARC header overhead and cache %hit will tank.

Needs to remain iSCSI/block:

4100 files, 1400 under 64kB, but 210 are 0.5GB or larger.
Veeam backup volume formatted as ReFS.
VMware VMFS datastore.

Regarding #3: SMB was unusable due to client-side caching behavior. It was fine with any files held open. The client closing a file then returning a few minutes later would trigger another transfer across the LAN and it slowed things down when rotating across the 210 large files. I tried every tunable I could find (both client and server-side) – no help.

L2ARC numbers thus far:

ARC:

ARC efficacy (I’ve worked with only a limited dataset since the last server reboot 28hrs ago):

rungekutta · February 5, 2025, 12:09pm

Some fine, data-driven debunking going on there. Will be interesting to see what you conclude from it.

I saw an (old) example from a mailing list from a freebsd developer who ran l2arc across two usb sticks on his low-ram laptop to offload the crappy built-in spinning hdd. Swore that with persistent l2arc, boot times reduced considerably.

Not necessarily for general recommendation… but obviously while the argument against l2arc on lower-ram systems is that the memory overhead puts even more pressure on an already struggling arc, the counter argument could be that the benefit at the same time could be so much greater, exactly BECAUSE the arc is not keeping up. There’s a trade-off there somewhere but I am very certain that it doesn’t boil down to ”with less than 64GB ram don’t even think about it”.

NickF1227 · February 5, 2025, 2:29pm

I never said anything to suggest otherwise?

Sure. My only point is that ARC contains MRU entries, and for good reason. This tunable presents more of a “hack” than it is addressing root cause of a problem.

For this usecase, and as discussed in the thread I posted earlier, I can see how this tunable may result in better than default behavior.

Please keep us updated as things progress. I’ve also set it on my NAS for science here and I’m not seeing very good results. But, like yours, it’s only been a few hours so likely doesn’t represent much.

L2ARC size (adaptive):                                         375.5 GiB
        Compressed:                                    96.3 %  361.6 GiB
        Header size:                                  < 0.1 %   42.9 MiB
        MFU allocated size:                            81.7 %  295.3 GiB
        MRU allocated size:                            18.2 %   65.8 GiB
        Prefetch allocated size:                        0.1 %  518.5 MiB
        Data (buffer content) allocated size:          99.2 %  358.8 GiB
        Metadata (buffer content) allocated size:       0.8 %    2.8 GiB

L2ARC breakdown:                                                    4.6M
        Hit ratio:                                      3.7 %     169.2k
        Miss ratio:                                    96.3 %       4.4M

richardm · February 5, 2025, 3:52pm

I believe it. The trade-off boils down to:

reduced ARC cache hits with some of the misses intercepted by L2ARC.
maximal ARC cache hits with any and all misses hammering the HDD pool.

I’d prefer the former, generally.

richardm · February 5, 2025, 4:06pm

This tunable presents more of a “hack” than it is addressing root cause of a problem.

Just to make sure we’re in agreement, how do you define the problem?

Yes, the tunable is a hack for L2ARC being less intelligent and capable vs ARC. IMHO the true fix (and how I define the problem) is for someone to join the overworked OpenZFS devs and show some love to arc.c l2arc functions.

nabsltd · February 5, 2025, 5:56pm

For consumer drives, yeah, this is a serious issue. But for even a 0.8 DWPD (about the lowest you can get) enterprise drive of any significant size, even that worst case isn’t a big deal. I’ve got 1.6TB 3DWPD NVMe drives that I use for similar cache (not on a TrueNAS install, but still the same as L2ARC) which cost less than $70 each. Those can sustain 7x the listed write.

NVMe with 17 DWPD at 800GB and 60DWPD at 375GB are all within each reach ($150 or less…sometimes far less). Those can absorb 20-30x the writes of that that absolute worst case “back-of-the-napkin” calculation.

My advice is to buy a decent enterprise drive and not worry about the worst case, because you won’t come close to approaching it, and even if you do, it’s still likely far less than the drive can handle.