L2ARC tuning guide and common misconceptions

Worth emphasising and why average record size being an important qualifier. (How does one see that in ZFS?)

Yes but… in that specific scenario wouldn’t it be appetising to add 256GB L2ARC and effectively fit the active data set there? Maybe also relatively more useful than going from 16 to 32 or 32 to 64GB RAM.

Obviously >128GB RAM would be even beyter still but maybe not the practical option.

Actually seeing the breakdown by recordsize requires a full walk of the (meta)data on your pool:

zdb -LbbA -U /data/zfs/zpool.cache tank

This will put a significant amount of I/O on your pool as it thinks through the metadata and take a very very long time:

estimated time remaining: 20hr 30min 19sec

So, you’re better off taking a conservative estimate.

It’s massively faster on all-SSD pools but what you eventually get is something like this:

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:  6.85K  3.42M  3.42M  6.85K  3.42M  3.42M      0      0      0
     1K:  13.4K  16.0M  19.5M  13.4K  16.0M  19.5M      0      0      0
     2K:  13.4K  34.2M  53.7M  13.4K  34.2M  53.7M      0      0      0
     4K:  34.2K   138M   192M  9.70K  49.7M   103M      0      0      0
     8K:  17.0K   164M   355M  6.85K  78.4M   182M  43.8K   351M   351M
    16K:  16.4K   372M   727M  17.2K   349M   531M  42.2K   716M  1.04G
    32K:  25.6K  1.13G  1.84G  20.3K   748M  1.25G  25.2K  1.03G  2.07G
    64K:  37.2K  3.07G  4.92G  6.49K   583M  1.82G  39.9K  3.61G  5.68G
   128K:   337K  42.1G  47.0G   406K  50.8G  52.6G   348K  59.5G  65.1G
   256K:      0      0  47.0G      0      0  52.6G    929   298M  65.4G
   512K:      0      0  47.0G      0      0  52.6G      0      0  65.4G
     1M:      0      0  47.0G      0      0  52.6G      0      0  65.4G
     2M:      0      0  47.0G      0      0  52.6G      0      0  65.4G
     4M:      0      0  47.0G      0      0  52.6G      0      0  65.4G
     8M:      0      0  47.0G      0      0  52.6G      0      0  65.4G
    16M:      0      0  47.0G      0      0  52.6G      0      0  65.4G

From which you can calculate the breakdown of your data - but its likely better to just estimate that you’re going to use twice as much as you think you will. :slight_smile:

The challenge is that your ARC will churn so rapidly that nothing stays in there long enough to be hit significantly more frequently than data that’s just passing through. Ghost lists will do some work and while it will eventually start to pass useful data in there, you’re likely to still have a lot of low-value “one-hit wonder” ARC records pushing potentially good data out of the L2ARC ring-buffer. Not that it won’t work but it’ll be less effective than if you had a larger initial ARC to hold and sort the data.

Again - the recommendations we make and defaults we have are somewhat generalized to apply to the vast majority of cases. The knobs are there for tuning, and we’re working on making more guidance for when and which way to turn them - but it’s very much dependent on each configuration and use-case.

5 Likes

Thank you. I’ve added the link to the first post.

In the discussions that followed, the ZFS dataset “recordsize” has been re-affirmed as being the maximum size. So perhaps mentioning that in the L2ARC calculator will assist people in understanding the output. And yes, you’ve listed “average block size” not “recordsize”… but users may overlook that initially, (I did).

1 Like

Very useful, thanks.

Yes, but…

  1. l2arc_mfuonly=2 should work to reduce the churn of “one-hit wonders” once things settle in. Also, still staying with your example, if the active working set is really 128GB, then even with smallish ARC the majority of activity will still be there (by definition) and the majority of L2 “churn” would still be within your active working set. 256GB L2 adds another 100% capacity outside of the active working set for miscellaneous.
  2. Still staying with your example… guess it comes down to average block size again. If you’re at the most extreme end of the scale with 4k then odds will work against you but in vast majority of cases the RAM overhead for 256GB L2 would still be in the order of 1GB or less, which even with 16GB RAM would seem a good trade-off (6% RAM) for what it could bring, maybe particularly in that case, as ARC itself won’t be anywhere near able to contain the active working set. As would be the case with 32GB too.

Anyway, we go around in circles a little bit. At the end of the day each case will of course look different, I understand why you’re erring on the conservative side and ultimately for the advanced users the most important part is probably to understand the concepts and trade-offs, what data to base the decisions on, and where to find it.

1 Like

About a day-and-a-half after adding a new L2ARC, setting
l2arc_write_max to 64M and l2arc_mfuonly to 2.

        Compressed:                                    98.3 %  365.5 GiB
        Header size:                                  < 0.1 %   37.2 MiB
        MFU allocated size:                            99.5 %  363.7 GiB
        MRU allocated size:                             0.3 %    1.2 GiB
        Prefetch allocated size:                        0.2 %  624.2 MiB
        Data (buffer content) allocated size:          99.6 %  363.9 GiB
        Metadata (buffer content) allocated size:       0.5 %    1.7 GiB

L2ARC breakdown:                                                   13.4M
        Hit ratio:                                      3.9 %     520.3k
        Miss ratio:                                    96.1 %      12.9M

Also FWIW, in ARC my MFU/MRU ratio for ARC hits looks very differant than @HoneyBadger

        Most frequently used (MFU):                    85.7 %       1.6G
        Most recently used (MRU):                      11.1 %     213.3M
        Most frequently used (MFU) ghost:               0.1 %       2.1M
        Most recently used (MRU) ghost:                 0.1 %       1.3M
        Uncached:                                       0.0 %          0

and total I/O from L2ARC, I see 1.6TiB of writes already. I have 2 200GB SAS SSDs for cache. Thats like ~3DWPD assuming 1.5 days. Not neglible. For these drives I’m not at all worried, but l2arc_write_max proceed with caution. @nabsltd

L2ARC I/O:
        Reads:                                      270.3 GiB     520.3k
        Writes:                                       1.6 TiB     148.7k

I have the EMC OEM version of this, I think. So a 9PiB endurance each?

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HITACHI
Product:              HUSMH842 CLAR200
Revision:             C250
Compliance:           SPC-4
User Capacity:        200,049,647,616 bytes [200 GB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x5000cca04a74162c
Serial number:        0LX1V4NA
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu Feb  6 14:26:47 2025 EST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Percentage used endurance indicator: 1%
Current Drive Temperature:     30 C
Drive Trip Temperature:        70 C

Accumulated power on time, hours:minutes 39573:03
Manufactured in week 47 of year 2015
Specified cycle count over device lifetime:  0
Accumulated start-stop cycles:  0
Specified load-unload count over device lifetime:  0
Accumulated load-unload cycles:  0
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 6939075

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0     175333.389           0
write:         0        0         0         0          0     120971.663           0
verify:        0        0         0         0          0      13998.341           0

I’m seeing this first-hand in my constrained RAM experiment. My l2size has been stuck in the low 30’s GB since last night. Feeds are sporadic and fairly small. You mention an “MFU index” above and I’m not sure if was meant as a hypothetical or it’s an actual constant in arc.c… But this seems like a good candidate to become a tunable if the latter…

I recall these two stats being mostly equal before this experiment began:

Unfortunately I didn’t screenshot it.

All that said I’ve just realized l2arc_headroom is 8 for some reason. I’m trying zero now.

This is amazing! Thanks for this! :smiling_face_with_three_hearts:

I’ve plugged-in “do not dare put L2ARC on that” numbers with a dinky recordsize and limited RAM to show a sort-of worst case:

A recycled 120GB fleaBay SSD would be perfect here.

I’m tempted to “build” this in a VM using virtual disks just to see the arc_summary numbers after a couple days of simulated use.

1 Like

Despite your ominous warnings I couldn’t resist to run this command on my own HDD pool (“must touch hot stove…” :upside_down_face:) and to my great surprise it only took between 5 and 10 minutes to complete (see sig for the details on my pool):

Block Size Histogram for pool 'elephant'
  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:  37.0K  18.5M  18.5M  37.0K  18.5M  18.5M      0      0      0
     1K:  16.7K  20.1M  38.6M  16.7K  20.1M  38.6M      0      0      0
     2K:  17.9K  45.3M  83.8M  17.9K  45.3M  83.8M      0      0      0
     4K:   137K   554M   638M  16.5K  91.8M   176M      0      0      0
     8K:  55.7K   552M  1.16G  18.7K   219M   394M  77.4K   929M   929M
    16K:   114K  2.07G  3.23G  46.9K   856M  1.22G   153K  3.58G  4.48G
    32K:  80.5K  3.62G  6.85G   133K  4.41G  5.63G  91.4K  3.80G  8.28G
    64K:   168K  14.9G  21.8G  15.3K  1.23G  6.86G   126K  10.3G  18.6G
   128K:  2.59M   333G   355G  2.85M   364G   371G   179K  32.3G  51.0G
   256K:  90.4K  35.6G   390G  5.55K  1.98G   373G  2.58M   686G   737G
   512K:   134K  88.6G   479G  11.1K  8.39G   382G  91.2K  72.2G   809G
     1M:   800K   866G  1.31T   960K   960G  1.31T   133K   178G   987G
     2M:  4.94M  9.88T  11.2T  5.04M  10.1T  11.4T   800K  1.69T  2.66T
     4M:      0      0  11.2T      0      0  11.4T  4.94M  19.8T  22.5T
     8M:      0      0  11.2T      0      0  11.4T      0      0  22.5T
    16M:      0      0  11.2T      0      0  11.4T      0      0  22.5T

Not knowing any better I suspect that this might be for two reasons:

  • My average blocksize is much higher (default recordsize is 1M, 2M for media files, and for my one sparse zvol the volblocksize is 128k)
  • The pool is less than two weeks old, so I assume that fragmentation is near zero, which might speed up the traversing of the pool.

Still, perhaps this means that running this command to get the stats for one’s own pool is feasible in more situations that you would expect. And that would make it easier to recommend sizing based on that data, rather than on general guidelines.

1 Like

Worth noting, the time it takes is only half of the equation. Running this command on a production system with other real workloads will have potentially large performance ramifications as well. For a lab system, go ahead! But real work is being done on other systems and making your userbase angry isn’t ideal.

Oh, absolutely! Not a command you’d run casually (even though I think it’s a bit of a shame that the command doesn’t provide a “low priority” option). But wouldn’t most systems have some idle hours per day or during a weekend to at least give this a try?

Congratulations, your average block size is in the order of 1MB, IF the same distribution was reflected in L2ARC you’d be able to fit several terabytes with only 0.5GB RAM overhead…

Perhaps more meaningful would be to L2 cache the more random read/write stuff but even 128k blocks should give you plenty to work with.

Special vdevs laugh in the face of your warning. Time taken to traverse a 10 wide RaidZ2 pool of 16TB HDD, with a three wide mirror of 2TB M.2 NVMe drives as special vdev. The task took less time to complete than Usain Bolt needs to travel 100m (on foot, anyway).

For 41.2T :slight_smile: I offer this as an example: is it worth using a tuned L2ARC with a pool with a special vdev? I expect not, but it’s a point worth considering.

Block Size Histogram for pool Network_Storage

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   141K  70.4M  70.4M  24.8K  12.4M  12.4M      0      0      0
     1K:  65.0K  71.8M   142M  92.6K  93.4M   106M      0      0      0
     2K:  13.0K  36.0M   178M  1.64K  4.22M   110M      0      0      0
     4K:  76.4K   326M   504M  1.26K  6.84M   117M  3.03K  12.1M  12.1M
     8K:  64.2K   681M  1.16G  2.45K  30.0M   147M   272K  2.16G  2.17G
    16K:  6.38K   147M  1.30G  33.2K   553M   700M  88.0K  1.94G  4.12G
    32K:  9.86K   467M  1.76G   206K  6.54G  7.22G  10.0K   457M  4.56G
    64K:  18.6K  1.67G  3.43G  9.92K   927M  8.12G  16.0K  1.44G  6.00G
   128K:  23.6K  4.33G  7.76G  26.0K  4.05G  12.2G  25.3K  4.62G  10.6G
   256K:  33.8K  12.4G  20.2G  13.5K  4.77G  16.9G  36.9K  13.6G  24.2G
   512K:  53.1K  39.5G  59.6G  13.2K  9.74G  26.7G  53.1K  39.4G  63.6G
     1M:   151K   199G   259G  68.7K  73.7G   100G   111K   153G   216G
     2M:  16.4M  32.8T  33.1T  16.6M  33.1T  33.2T  16.4M  41.2T  41.4T
     4M:      0      0  33.1T      0      0  33.2T      0      0  41.4T
     8M:      0      0  33.1T      0      0  33.2T      0      0  41.4T
    16M:      0      0  33.1T      0      0  33.2T      0      0  41.4T
1 Like

Can we even derive such a conclusion from the block size histogram alone?

My understanding (right or wrong) is that so far we’ve been at the “First do no harm” stage, that is to find a way to determine an upper limit for L2ARC size, so that it won’t impact the effectiveness of the ARC cache, by taking too much memory away from it.

Just glancing at your histogram it appears that your average block size is well above 1MiB, so using the calculator for the system in your sig, you can use an 200 TB L2ARC cache and use less than 10% of your RAM. Or for a 40TB L2ARC cache you’d use less than 4GB RAM.

So an L2ARC cache of any conceivable size appears to be no danger to your pool’s performance (but perhaps a danger to your wallet’s performance :wink:).

Projecting the potential benefit (e.g. in percentage of requests served from L2ARC) would appear to be a different beast altogether though…

3 Likes

If I had the money for a 40TB L2ARC I’d just put everything onto consumer NVMe drives!

My understanding of the questions raised in this thread boils down to: how much cheaper, safer and easier than a special vdev can L2ARC be, whilst reaching 80+% of the benefit? Is this wrong (asking as someone that doesn’t work in tech)? It’s just that the figures generated do need the context, and it feels like the thread will travel in the direction of trying to answer the following question: at what point is it that a special vdev might make sense?

Thanks for the tool.
But, on second thought, maybe we’d need an extra parameter box to account for write cache/networking speed. Base recommendations are for 1 GbE; switching to 10 GbE takes up to two txg worth of RAM space for the write cache, reducing available ARC size by a not-unsubstantial amount, and then it seems that “L2ARC footprint at 10/25% RAM” does not fully capture what’s going on…

You need to understand what a Special Allocation device / vDev can do:

  • Metadata & the indirect blocks for user data, (but not user data)
  • Small blocks from files, (user adjustable size what defines a small block)
  • DeDup table

But, I take your point.

One key difference is that you can loose a L2ARC device without loosing your pool. If you loose a Special Allocation device, your pool is gone. That is why the recommended redundancy of a Special Allocation device / vDev be the same as the pool. So, with RAID-Z2 in the main pool, you should have 3 way Mirroring of the Special Allocation vDev.

Note that you can NOT Mirror or RAID-Zx a L2ARC device. You can stripe them, if so desired, which can mitigate loss of a L2ARC device. The remaining L2ARC device(s) will continue to serve up what data they can, and start to load more data that might have been from the failed L2ARC device.

Another key difference between L2ARC and Special Allocation device(s) is that a L2ARC can be removed at any time. A Special Allocation device can only be removing if the data pool consists of Mirrors only. Any RAID-Zx or dRAID and the Special Allocation device(s) is forever stuck attached to your pool.

3 Likes

It’s doing percent of total RAM, not percent of ARC. If you want to try to figure out ARC size, the javascript is here: https://jro.io/l2arc/l2arc.js

1 Like

I am absolutely not an expert on ARC or L2ARC, but I have a couple of comments:

  1. My ARC on my small NAS (media / backup server) behind c. 10TiB of data is only c. 3.5GB in size, but achieves a cache hit rate of > 99% (so >>>>> 90%).

  2. With that level of hit rate, I can only assume that what falls off the bottom of the MFU cache and would be moved to L2ARC if I had it would be very rarely read.

Steady state

NAS systems are supposed to stay up for months on end, so let’s consider the situation of a NAS that has been up and running for long enough for ARC to be dropping stuff off the bottom of the MRU / MFU lists.

My gut reaction based only on my unique sample of one system is that if you have a reasonable amount of ARC but are getting < (say) 95% cache hit rate, then either

  1. Your workload is sufficiently random that ARC doesn’t help - in which case neither adding more memory nor adding L2ARC will help.

  2. Your workload is not random, in which case adding more memory to increase ARC is likely to improve cache hits and that might well be a better investment than L2ARC.

  3. But if memory is maxed out, then L2ARC might help serve metadata / data that would have stayed in ARC had it been bigger.

This is still pretty much what the old guidelines used to say. Yes - reducing the L2ARC overhead from 180ish bytes to 90ish bytes is a help, but L2ARC still take memory away from normal ARC functions.

Persistent L2ARC

Again, naively perhaps, it seems to me that the older non-persistent L2ARC didn’t have that much to offer before reaching steady state (e.g. as an edge case during boot) because it was empty.

But now it is persistent then I can see that it might be beneficial to populating ARC before it reaches stability - kind of in the same way that Windows has boot optimisation using a pre-fetch cache except on Windows it is the same boot drive whereas L2ARC is on faster technology.

L2ARC vDev vs. special allocation (metadata) vDev

I do appreciate that L2ARC is a helper cache which doesn’t lose data if it fails, whilst Metadata vDevs are critical for the pool and needs to be redundant (so significantly more hardware & $$$), but if I am designing a new system / pool which should I choose (assuming that it is one or the other and whichever I choose will be on the same NVMe Optane technology)?

L2ARC / Special vDev allocation block sizes

Can you specify an asize or similar for these specialised vDevs, and if so what relationship should they have to the asize of your pool/data vDevs and how should they relate to your dataset recordsizes (which have no equivalent in either L2ARC or special vDevs)?

I genuinely have no idea what the answers are or whether my guesses or right or wrong. So if anyone can respond to these points that would be great.

1 Like

Overkill in both uses. Optane is most suited as SLOG—which doesn’t mean it would make for a bad L2ARC or sVDEV. L2ARC should preferably be NVMe, for the sake of speed; but sVDEV to HDD storage could well be SATA SSDs—fast enough, and even faster as it will be a mirror.

Some ideas:

  • L2ARC only speeds up reads; to speed up writes, sVDEV is required.
  • L2ARC can always be removed; sVDEV on raidz# storage is an irreversible choice.
  • If one is willing to go off-road, a single Optane drive can be partitioned to serve as both SLOG and L2ARC.

No. Pool-wide settings.

In his discussion of sVDEV, @Constantin pointed that a zvol whose record size is small enough to be a “small block” is entirely stored in the sVDEV. So one could have a single pool with large HDDs for bulk storage and a sVDEV for VMs. But I’m not fully convinced that the simplicity of having a single hybrid pool for all uses outweight the flexibility of the traditional approach of having separate HDD and SSD pools for different uses.

1 Like

My own hardware (see sig.) is really pretty underpowered. Nowhere near enough memory for VMs, but it does run a couple of meaty apps (Plex and Unifi) and a handful of smaller ones, and yet as a NAS with a very limited sized ARC the performance is still brilliant!!!

I definitely don’t have the ports to support a special vDev, and not really to support a dedicated L2ARC SSD though I suppose I could go even further off-piste and add an L2ARC vDev to my USB boot SSD which already has an apps pool added - so that’s two support rules broken right there so heck adding an L2ARC on top to make that a hat-trick of broken rules doesn’t seem too bad).

But my underpowered NAS performs brilliantly without either of these, so why shouldn’t everyone else’s decent powered NAS.

1 Like