Special VDEV (sVDEV) Planning, Sizing, and Considerations

richardm · January 1, 2025, 11:20am

I’ll go out on a limb here and suggest most people should have a SLOG. Tons of innocuous, everyday, mundane stuff tosses syncs into storage at critical moments (filesystem journaling, directory updates, etc).

Much of my own TrueNAS traffic is block storage over iSCSI to a Windows box which has it mounted/formatted as an NTFS volume. Delete a file? Boom, a sync. Move a file? Sync. Empty the Recycle Bin? Sync. I run iostat and I see it every time. TrueNAS as a target for Windows built-in backup? Sync storm! Windows running in a VM residing on a VMFS datastore hosted by ZFS? Same thing. I witness various activities in Linux VMs running several different filesystems doing similar.

The dirty little secret about ZFS sync writes that must hit storage immediately is doing this often screws up the locality between ZFS metadata and the actual [async-written] data. Having a SLOG alleviates the emergency and allows metadata to be quietly and calmly inserted into the current txg alongside its accompanying data. There’s discussion about this phenomenon and the resulting fragmentation on the OpenZFS github that I can dig up possibly…

The other secret that shouldn’t be a secret but is for some reason… A SLOG can help read performance in a spinning rust pool. Wait what?

With no SLOG those pitifully slow HDDs have to stop whatever they’re doing – which might happen to be reading some data – and commit those sync writes to permanent storage right now. I’d rather mine continue with their reading until it’s time for the next txg commit.

Most people should have a SLOG device.
I am willing to die on this hill.

etorix · January 1, 2025, 12:25pm

I’d go on the other limb and submit that most home users should NOT bother with a SLOG at all.
SMB is asynchronous by default. The major exception is Macs storing TimeMachine backups through SMB—but then we do not care about performance because TimeMachine is a background process, so let’s use the built-in ZIL and be happy with it.

Block storage does benefit from a SLOG. But iSCSI is a costly exercise (high RAM, mirrors, low occupancy), best left to enterprise settings and not used on a home NAS.

richardm · January 1, 2025, 2:30pm

SMB does what it’s told.

It defaults to async if it isn’t told.

BTW, add ACDSee to the list of programs that push syncs. Over SMB.

 SMB2 Header
        ProtocolId: 0xfe534d42
        Header Length: 64
        Credit Charge: 1
        NT Status: STATUS_SUCCESS (0x00000000)
        Command: Write (9)
        Credits granted: 1
        Flags: 0x00000039, Response, Signing, Priority
            .... .... .... .... .... .... .... ...1 = Response: This is a RESPONSE
            .... .... .... .... .... .... .... ..0. = Async command: **This is a SYNC command**
            .... .... .... .... .... .... .... .0.. = Chained: This pdu is NOT a chained command
            .... .... .... .... .... .... .... 1... = Signing: This pdu is SIGNED
            .... .... .... .... .... .... .011 .... = Priority: This pdu contains a PRIORITY
            ...0 .... .... .... .... .... .... .... = DFS operation: This is a normal operation
            ..0. .... .... .... .... .... .... .... = Replay operation: This is NOT a replay operation
        Chain Offset: 0x00000000
        Message ID: 19429
        Process Id: 0x0000feff

Constantin · January 2, 2025, 1:39pm

@Stux is 100% right, I’ve edited my post to add the missing part of the sentence that should have been there but didn’t make it as I had a house full of revelers.

Optane or like SLOGs have significant benefits for sync-heavy workloads. But before you take the plunge, consider your use case and just how much a SLOG might actually do.

For my use case, I have found the addition of a sVDEV to my pool to be a significant upgrade while the SLOG was comparatively unnoticeable. Small files zip in and out, ditto Metadata. I also got rid of my L2ARC, because it was benefitting me so little.

For example, using a sVDEV with the ACDSEE example above, I’d host that database on a datashare that is 100% SSD by making its recordvol size smaller than the sVDEV small file cutoff.

That’s the beauty of a sVDEV, the ability to optimize storage type and use case by application, all in one pool. Overall, that should result in a machine with fewer drives overall even while the sVDEV should be a based on a triple-mirror or better.

If the A2CDSEE database is solely hosted on a ssd part of the fusion pool, I doubt a SLOG will make nearly as much of a difference re: sync writes than if it were hosted in a HDD hosted part of the pool.

That’s not saying SLOGs do not have a performance benefit - they do, and Optane is a lot faster than a regular SSD, but unless the use case is heavily skewed towards sync writes (see examples above) it may not be as noticeable as the benefits of a sVDEV on pool performance.

nowm · January 4, 2025, 2:44am

I think the sVDED especially its capability of storing an entire dataset has great power, effectively we are r/w on SSD. The only problem is the cost efficiency of 3way mirror. There’s the problem for home users, there are too many kinds of supporting devices and, only l2arc do not require mirror.

2 boot
2 slog
3 svdev with multi-TB if extensively used
1 large l2arc, and more if want increase the speed to 50+Gbe

That’s a lot SSD!

etorix · January 4, 2025, 8:56am

Neither boot nor SLOG requires mirorring, and I submit that most home users never need a SLOG at all, nor 50+ Gb/s networking (and if they do they should be on all NVMe!).
L2ARC with a sVDEV is redundant.

So that’s one small SSD for boot, really.

nowm · January 4, 2025, 9:28am

Well the definition of “home user” is indeed blurred, I guess any non-enterprise usage can be called “home user”. It is not uncommon that “home users” running VMs that host git repos, and there is a huge amount of sync writes, plus lost 5s worth of write could be catastrophic in such scenarios. But yeah, buy a 400GB Optane is less than $100.

For small files in svdev, yes, and I think using a l2arc only reduce the performance.
For large files? ARC swaps? that’s a question.

Constantin · January 4, 2025, 5:50pm

That’s also misleading. Usually, folk here have some HDD and some SSD serving different purposes in a traditional TrueNAS setup. i.e.

One or more SSD for boot
multiple HDD for data
Two or more SSD for apps / VMs
SLOG(s) and L2ARC, as the use case or admin vibes require.

So the net difference between a sVDEV-using system and a traditional TrueNAS setup is the number of SSDs devoted to sVDEV vs. SSDs used for Apps, VMs, and the L2ARC. That SSD difference nets out to zero drives if a L2ARC is obviated by a triple-mirror sVDEV replacing the traditional dual-mirror application / VM data-share setup.

As an aside: there usually is no reason for a SOHO user to have redundant SLOGs, if they are needed at all. SLOG benefit is also unlikely to be as great if you simply put all the SLOG-heavy data onto datasets that reside entirely on a 3-way mirror sVDEV. But if you need certainty re: SYNC data, a SLOG is a must and at the same time that requirement has zero to do with sVDEV use.

Whether or not two boot drives are a good idea or not is nuanced but has nothing to do with sVDEV. As others here have pointed out, a HA boot pool system is far superior to a mirrored boot pool. So, yes I have more than a few SSDs in my system and yes, it would be kewl to consolidate some of these drives via the use of partitions. But TrueNAS doesn’t allow that and so we have what we have.

richardm · January 6, 2025, 9:51pm

Sooner or later this commit will percolate into TrueNAS (if it hasn’t already) and I’m curious how well it works.

If a persistent L2ARC with l2arc_mfuonly=2 is effective in absorbing and retaining the pool’s metadata it could save some of us from this sVDEV brain damage.

Edit: It’s in SCALE 24.10.1. Playing with it now…

craigdt · January 8, 2025, 2:35am

Add an extra 3 for the svdev and you can have a striped 3 way mirror for extra capacity and performance, only 6 ssds with the storage of 2, amazing space to parity ratio

nowm · January 8, 2025, 7:22am

Say we have a machine that has svdev, l2arc, slog,

Is there a way to configure the pool so that the l2arc could refrain from cache the small files stored by the svdev? As svdev are often mirrored and l2arc may not, directly read from svdev has better performance than read from l2arc.

Further during async write, can those small files to be stored in the svdev, be directly written to the svdev instead of queued and serialized in the zil?
Otherwise, during sync write, especially, can zil be directly write to the svdev and by-pass the slog entirely?

Constantin · January 8, 2025, 12:41pm

Bigger question is what is the benefit of a L2ARC if you have a sVDEV.

A sVDEV is by definition “hot” re metadata and small files. So L2ARC does zero for either.

With a sVDEV, a L2ARC only has a potential use case for larger-than-small-file-cutoff files that are read repeatedly. Unsure what that would be.

I saw so little L2ARC activity post-sVDEV that I removed it and repurposed the drive.

HoneyBadger · January 8, 2025, 4:16pm

Not at the moment, but that might be a potential improvement to poke upstream to ZFS itself, exposing tunable knobs for “recordsizes to be considered for L2ARC” so you can narrow the range when sVDEV does/does not exist.

async writes don’t engage zil already, the transaction is quiesced in RAM only and the SPA will shove them straight on the sVDEV (assuming they stay under the threshold for special_small_files)

ZIL positioning on the sVDEV might be an interesting possibility but then you’d need to have the same requirements for your sVDEVs in order to have good performance. There is a threshold at which a large write will be written to disk and then only the “updated pointer” put in the ZIL, but that threshold is higher than most people should be considering putting their special_small_files value - it’s intended to avoid swamping the ZIL with large sequential writes that can be handled by underlying pool VDEVs, and leaving the ZIL free to handle the small sync-write demands.

Krill · January 10, 2025, 10:42am

Out of interest, is the ZIL always on a data vdev unless there is a slog?

If so, what would happen if the smallest possible data vdev was used, filled to 100% capacity (say a usb memory card reader with a 1GB SD card, but there was an svdev of a couple of mirrored things size nvme drives, that was using dataset cut off size shenanigans?)

HoneyBadger · January 10, 2025, 6:50pm

I believe the ZIL resides on SLOG first and data VDEVs second, so Very Bad Things would happen at 100% data VDEV capacity even if there is technically free space on the sVDEV. The SPA would throw the error trying to write into the ZIL first before it got a chance to write to sVDEV. Asynchronous writes that fit entirely into sVDEV might succeed in that scenario but the updates to meta/uberblock would probably still make it go “nope” on the final txg sync though.

Stux · January 19, 2025, 10:19pm

Very bad things always happen at 100% capacity on a COW system

Krill · January 20, 2025, 8:55am

Technically that would not be 100% capacity of the storage pool. It could be as low as 1%.

etorix · January 20, 2025, 12:06pm

At 100% capacity on the sVDEV, further (meta)data just goes to the regular data vdevs, so nothing bad happens but the performance benefits are lost.

Krill · January 20, 2025, 2:45pm

The scenario I originally postulated involves the ZiL getting blocked by a 100% full, yet incredibly small data VDEV, whilst the sVDEV still has significant spare capacity: HoneyBadger stated the belief that the ZiL never resides on the sVDEV. Hence the point that Stuxs’ "Very Bad Things"™ can also happen also happen on a not-100% capacity COW system.

Although to be clear, it would have to be an absurd storage pool construction and it should not happen in the real world, but it does raise the question: should a ZiL be able to exist on a sVDEV?

J_B · January 20, 2025, 4:01pm

Hi everyone,

This topic is really interesting, and I’m wondering how to understand it in my user scenario.
I have a pool in Stripe + Mirror for my movies and a huge amount of nfos and jpgs. So, I see the benefit of the sVDEV.

But I don’t see the direct correlation between block size and file size.

If I set the cutoff to 64K, the cumulative is something like 70GB for a 15TB (data in use) / 25TB (total space) pool. I should aim for a 2x256GB sVDEV. 30€.

Constantin said that a 1M block size is better for media files. I am a noob in this domain and chose the 128K default setting in the first place.

Here are my questions:

If I set the cutoff to 64K, will all the thumbnails and pictures be stored on the special sVDEV if I rebalance it? (As I mentioned, I don’t see the link between block size and file size.)
Should I split my mirrors, create a new pool with a new block size and the sVDEV, and then rebuild my mirrors?

What do you think about this use case?

Topic		Replies	Views
Suggestion: Better sVDEV Planning, Oversight Tools in GUI TrueNAS General sVDEV , Feature-Request	55	1800	February 19, 2026
Questions about Fusion Pools (special/metadata vdevs / sVDEVs) TrueNAS General CORE , SCALE , Performance , ZFS , sVDEV	8	800	April 7, 2024
My first ZFS pool / datasets / Zvol: Asking for sanity check TrueNAS General SCALE , iSCSI , ZFS , pool	4	229	January 29, 2025
Metadata vdev, data vdev expansion and Electric Eel query TrueNAS General	11	458	September 23, 2024
Metadata vdev drive replacement + size increase? TrueNAS General CORE , ZFS , sVDEV	12	1236	June 22, 2024

Special VDEV (sVDEV) Planning, Sizing, and Considerations

Related topics