Special VDEV (sVDEV) Planning, Sizing, and Considerations

HoneyBadger · January 20, 2025, 7:03pm

There might be potential for putting it there as an optional - but the characteristics to drive maximum SLOG performance can get expensive if you’re trying to build large-capacity sVDEVs out of them.

recordsize is a maximum allowable chunk for files - so in your case, the 128K default means that your media files would be split into 128K records. Since your threshold for small files is 64K, your media files will stay on the main vdevs, and the smaller JPG/NFO files will be stored on the sVDEVs as long as those files are <64K. If you increase the “maximum” recordsize, it won’t impact any files that stay under that threshold, and will just store them as the larger more efficient 1M chunks.

However, in order to benefit from them, you’d need to re-write the files. It’s up to you as to whether or not you think that’s worthwhile.

J_B · January 20, 2025, 7:25pm

Thank you very much, I totally get it!
My biggest png is 13MB, and my biggest jpg is 3MB.

I assume that 13MB is too big to be the cuttoff!

edit : apparently it is not ! But it means that the blocksize should be 16MB. Any bad points to that ?

16M: 0 0 15.4T 0 0 15.4T 0 0 15.6T

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   1023   512K   512K    896   448K   448K      0      0      0
     1K:    495   651K  1.14M    411   553K  1001K      0      0      0
     2K:  3.65K  10.2M  11.3M  3.49K  9.76M  10.7M      0      0      0
     4K:  93.0K   372M   384M  8.19K  44.9M  55.7M  61.0K   244M   244M
     8K:   513K  5.94G  6.32G  2.43K  24.5M  80.2M  48.9K   414M   658M
    16K:  26.7K   591M  6.89G  9.18K   162M   243M   440K  10.2G  10.9G
    32K:  55.2K  2.49G  9.38G   528K  16.5G  16.8G   139K  5.14G  16.0G
    64K:   563K  56.2G  65.5G  3.92K   363M  17.1G   556K  55.7G  71.7G
   128K:   123M  15.3T  15.4T   123M  15.4T  15.4T   123M  15.5T  15.6T
   256K:      0      0  15.4T      0      0  15.4T     14  3.82M  15.6T
   512K:      0      0  15.4T      0      0  15.4T      0      0  15.6T
     1M:      0      0  15.4T      0      0  15.4T      0      0  15.6T
     2M:      0      0  15.4T      0      0  15.4T      0      0  15.6T
     4M:      0      0  15.4T      0      0  15.4T      0      0  15.6T
     8M:      0      0  15.4T      0      0  15.4T      0      0  15.6T
    16M:      0      0  15.4T      0      0  15.4T      0      0  15.6T

edit 2 : doesn’t seem to be a good idea to set it to 16MB : 16 MiB recordsize?!?!?!?! <--- clickbait punctuation marks for the algorithm - #37 by winnielinnie

etorix · January 20, 2025, 8:40pm

If you’re running raidz# you want your files to be in several chunks spread across drives.

HoneyBadger · January 20, 2025, 10:20pm

Let’s see if I can figure out flowcharts here.

graph TD
    A[New Write] --> B{Is Write >\n Recordsize?}
    B -->|Yes| C[Split into\nrecordsize chunks]
    B -->|No| D
    C -->D{Is this split chunk >\nspecial_small_blocks?}
    D -->|Yes| G[Write to data vdevs]
    D -->|No| E{Is special vdev full?}
    E -->|Yes| G
    E -->|No| F[Write to special vdev]

If special_small_blocks is equal to recordsize then everything will land on your special vdev(s) until they’re full. So in this case with your PNG/JPG files being multiple megabytes, I don’t know that you’ve got an easy way to do this.

The linked thread on larger than 1M recordsizes is a good read as well.

J_B · January 21, 2025, 10:50am

thanks

richardm · January 21, 2025, 5:36pm

You have a textbook use-case for an L2ARC with l2arc_mfuonly=2. Those nfo and jpg sidecars – which I presume are metadata for your media app and are therefore read repeatedly – will find their way into L2ARC and will persist across reboots.

People who have a poor use case for L2ARC will talk down L2ARC.
People who have never tried the l2arc_mfuonly tunable will talk down L2ARC.
People who aren’t aware of L2ARC persistence across reboots will talk down L2ARC.
People who in 2012 had a bad experience with L2ARC will talk down L2ARC.
People with insufficient RAM will set up a multi TB L2ARC then complain about reduced ARC size and performance – then talk down L2ARC.

Try one. It’s the least technically demanding path to OpenZFS performance enhancement. It doesn’t need high endurance or power-loss protection. It doesn’t need a mirror. As a worst case it corrupts or dies then you’ll have to replace the drive and let it repopulate.

Glorious l2hit% goes brrr…

Glorious CACHE vdevs go brrr…

J_B · January 21, 2025, 8:21pm

Very nice !
I’ve been reading a lot about L2ARC since last april, because I was searching something to help me in my scenario.
Never heard of this option until today.
People were avoiding L2ARC and talked about the ARC with a significant need of ram. So I maxed it to 64GB (my motherboard limitation).
I searched how to set l2arc_mfuonly but this option seems quite new. Any topic I should refer to ?

Ty very much for your advices.

richardm · January 22, 2025, 8:58am

I’m unsure of the official/correct/blessed method for setting a tunable in SCALE. I’ve always used System | Advanced | Init/Shutdown Scripts:

Someone will be along shortly to point out this not the right way, I’m sure. Just sayin’ that it worked for me…

Side note: It seems older ZFS which doesn’t understand a value of 2 will treat it as 0 and resume the usual dopey caching behavior.

l2arc_mfuonly limits L2ARC to “most frequently used” (MFU) data. The idea is to prevent large one-shot data transfers (e.g. backup jobs, media files, big data copy jobs) from replacing useful cache content by rendering “most recently used” (MRU) data as L2ARC ineligible.

IMO the tunable represents a paradigm-shift for L2ARC. It will absorb and retain your most frequently accessed blocks thanks to L2ARC persistence across reboots. As your data changes and/or your data access patterns change the blocks in L2ARC eventually follow. This 2020-era commit plus L2ARC persistence across reboots and an earlier architectural change which cut the header size from 320b to 88b have made L2ARC far more tolerable and useful across a wide variety of systems.

Which of us with slow HDD pools wouldn’t want their most frequently accessed blocks sitting in a fast SSD cache ready to be served sub-millisecond?

The ability to set 2 on this tunable is very new. It makes both MFU and metadata L2ARC eligible. Sounds great but I’ve no hard data (yet) other than it not breaking anything.

One more tidbit I’ll address since I’ve clearly found another hill to die on… People are instructed to ditch their L2ARC when they post an arc_summary showing a low hit rate. I have three comments on this:

We don’t know when those cache hits occur. Are they happening with a human at the keyboard? Is the human sitting there waiting for the beeps to boop? I get those wonderful strings of bright green l2hit% in arcstat doing things I do often. When the HDDs churn and the cache misses I feel it. My long-term L2ARC hit rate often sits sub 20% but none of you are prying it out of my cold, dead hands.
Getting a hit every one or two out of ten data accesses sounds lousy until you consider those hits are being served sub-millisecond and kept out of a queue where they might get serviced sub-10ms if we’re lucky. It only takes a handful of hits to make a difference. Almost daily I watch my HDD pool bog down with latency blowing out to triple digits. I’ll gladly throw a little hardware at a chance to divert some IOPs away from this dumpster fire of spinning rust.
L2ARC hits free-up the spinners to better service write requests. It’s the part the people forget about effective read caching – it helps write performance as well.

I’ll save the wordy SLOG rant for some other day.

J_B · January 22, 2025, 9:09am

I think you should open a thread about this new option for the L2ARC and we could discuss about it. Many users would be interested.
Just like Constantin did for the sVDEV.

I’ll try to figure the size of my L2ARC in l2arc_mfuonly=2 and test it for a while.
Ty Richard

HoneyBadger · January 22, 2025, 4:32pm

Should probably be pre-init for a kernel module parameter but that’s how you can get them in for now. Let me open a UI ticket for the sysctl interface though.

It’s actually 96 bytes now I believe, thanks to said added L2ARC persistence, but who’s counting, right?

It’s important to differentiate here between what ZFS considers metadata and what a client app considers metadata. NFO and JPG files aren’t “metadata” to ZFS so l2arc_mfuonly=2 will have no impact on them; but the value of 2 is probably still better as it means all the ZFS metadata is eligible for L2ARC fill.

Having written a few before, I look forward to reading it.

J_B · January 22, 2025, 8:12pm

So, l2arc_mfuonly=2 will exclude MFU files from the L2ARC?
=0 means MFU and MRU
=1 means MFU
=2 means ?

I’m googling it but the articles and threads I find are old. Very few are talking about l2arc_mfuonly=2

If the pool metadata and the MFU are stored, I would need at least a 256 or a 512GB drive. My 64GB ram will be overused with the 512GB L2ARC. The ratio seems to big.

More advices ?

etorix · January 22, 2025, 9:01pm

The “classical” sizing advice for L2ARC is 5 to 10 times ARC (RAM). So 512 GB L2ARC with 64 GB RAM should be fine; go ahead.
Metadata ARC may have a slightly different calculation, but if anything it should be more lenient.

HoneyBadger · January 22, 2025, 9:10pm

l2arc_mfuonly=2 is a very new option.

0 means MFU and MRU - the default setting
1 means MFU only (data and metadata)
2 means MFU for data but MFU and MRU for metadata

But again - this is ZFS metadata - not the NFO/JPG files that your client app considers as metadata. l2arc_mfuonly=2 is probably a good change to make - I don’t think I’d say “it should be the default” but in your use case of video playback, that’s easily served from spinning-disk vdevs, and the intended purpose is “accelerate the thumbnail and text access for quicker browsing.”

J_B · January 23, 2025, 1:27pm

I just ordered a 500GB crucial SSD on sale, thanks everyone for your help

richardm · January 25, 2025, 6:38am

Pull request.
Earlier discussion.
Commit.

PlanetPost · February 3, 2025, 8:48am

WELL!!! Thank you to EVERYONE!

Most of all @Constantin for the original post. This subject has been a thorn in my side for a while. I always understood that for my workload, combined with my configuration, the L2ARC was USELESS (to me), and RAM is GOD (So I invested in GOD).

BUT! I have 84 8TB drives of spinning SAS Rust that I need to index. Hopefully Optimize. Bonus is save a bunch of index bursts to save power.

Most of the data WILL be large video files.

I am in the planning stage and procurement stage. I invested in semi-recent e-waste 13-gen dell servers and loaded the RAM. Also invested (thankfully at the right time) in some 1TB Intel 905P u.2s, and a chunk of Dell BOSS cards and some Black Friday Gen3 Consumer NVME and h.20 Optane at stupid great prices.

I know what everyone is doing (cluster), know what everything is doing (BOSS/other cards with H.20’s for VMs/DBs and maybe a SINGLE SLOG), but I always had a NAG…

I have 84x8TB SAS Drives (at a hell of a deal) + 4x24bay JBODs (STUFFED At an even better deal), and I need to index them. Or Manage the meta-data.

So I would like to re-thank @Constantin for being clear, when EVERYONE ELSE was clear as MUD. Not just here BTW. I have scoured the REDDIT RATS, Level1 Forums, and the abyss, including the TrueNAS Docs.

Ambiguity is the cruel mistress of confusion.

Secondary thank-you to the others like @HoneyBadger & Others

I know my workload, your examples matched what I thought should translate. The rest of the NET is confusing and contradictory to reason.

T3 Truenas Youtube REALLY needs to get on this topic.

Enough about L2ARC, BUY/CONFIGURE RAM.
Tier Your Storage.
Focus on indexing and storing the crumbs.
Meta-Data SVDEV is the performance obvious.

I do have a ZFS/TRUENAS request though.

MOST of the confusion (As I see it in this post even) is derived by the combination of the indexing of the pool and small files on the same device.

MAYBE we should be looking at separating the indexing and the small files the way you currently tackled the SLOG.

What a wonderful and streamlined world it would be, if rather than manually predicting the file sizes and saving them to the correct/speed pool/vdev….

We have the algorithm(s) responsible for small-block-size assignable all on their own to their own pool. Ya I know I stepped out of the pool (box), Maybe we can figure out a way to accomplish the same thing without stepping out of the pool/box?

Anyways… SERIOUSLY prudent post and I am happy I got to learn from it.

Cheers.

volts · February 3, 2025, 4:13pm

Isn’t that:

     l2arc_exclude_special=0|1 (int)
             Controls whether buffers present on special vdevs are eligible
             for caching into L2ARC.  If set to 1, exclude dbufs on special
             vdevs from being cached to L2ARC.

It’s all or nothing, and I’m not advocating for or against it.

richardm · February 3, 2025, 4:23pm

Good find. l2arc_exclude_special is very new to the OpenZFS upstream. Not sure if it’s hit TrueNAS yet…

volts · February 3, 2025, 4:25pm

Nah, it’s a few years old. It went into zfs 2.1.something.

HoneyBadger · February 3, 2025, 4:25pm

It has, and it’s defaulted to 0 because of that “all-or-nothing” behavior. Bears further investigation into what we could and couldn’t do to further granulate this.