Analyse usage of special vdev

Constantin · August 13, 2025, 7:44pm

Per the above, small files will fit into small blocks so the expectation is met.

What I didn’t anticipate is tails of large files that are smaller than the small block cutoff going into the sVDEV as well. I will have to re-write the sVDEV resource to be more careful in that phrasing.

I don’t think there is anything inherently wrong with tails going into the sVDEV, presumably fragmentation and lots of leftover file tails mixed in small files have a lower impact on system performance there than they would in the HDD pool.

Still, it’s curious just how little of my sVDEV is in use vs. pool capacity. Anyhow, many thanks to everyone re: enlightening me!

winnielinnie · August 13, 2025, 7:55pm

It wouldn’t just be the tails. Any middle block that compresses below the cutoff will also apply. This can theoretically happen with highly compressible files.

swc-phil · August 13, 2025, 8:42pm

You have 24.5G of <=128K blocks. I don’t understand how you ended up utilising 649G of sVDEV, unless:

zdb’s hist doesn’t count for snapshot blocks. Or something like that.
You have (1M) datasets with special_small_blocks set to 1M. Thus making them “ssd-only”.

swc-phil · August 13, 2025, 8:44pm

I wonder whether it’s useful for the wide raidzX as well.

mav · August 13, 2025, 9:07pm

RAIDZ is not great for small blocks, but not as bad dRAID. For 9-wide dRAID your smallest physical allocation is 9 ashifts, no matter how little you write, while for any width of even RAIDZ3 it is “only” 4 ashifts. For more typical RAIDZ1 – just 2 ashifts. So RAIDZ in worst case will be close to mirror on efficiency, while dRAID has potential for much worse.

Sara · August 14, 2025, 5:53am

Ahh I see. Hmm… I would go even one step further.
I have only roughly 350GB of files that are NOT 1M. So how do I ended up with 750GB svdev? Just tails?

Or maybe old snapshots?

Little bit off topic, but it depends on the size and pool geometry.

Imagine a RAIDZ2 that is 12 drives wide.
From a traditional RAID, you would expect 83.33% storage efficiency. (12-2) / 12 = 83.33%
But for RAIDZ2, if the files are 16k in size, you only get 66%.
See this see this RAIDZ efficiency table and explanation on why.

This might be a none issue for you, if for example you only store files that are +1M in size and have a record size of 1M. Because the bigger the record, the smaller the “RAIDZ problem” tends to get.
If you use zvol for VMs with a blocksize of 16k, all your writes will be 16k (or below thanks to compression). This could make the potential problem worse.

With dRAID the problem is, that the smallest write has to be the size of the number of data drives multiplied by sector size (probably 4k). So in a dRAID2:10d:0s:12c, which is two parity, 10 data, the smallest write possible is 40k.

swc-phil · August 14, 2025, 6:18am

Tails are blocks themselves, so I don’t know. Can it be that you have auto-trim disabled?

Also, zfs send without the -L flag “generates” blocks with max size of 128K. But again, they would have been in the hist…

Yeah, I’ve already figured it out. Perhaps will read provided link later, just in case.

Sara · August 14, 2025, 9:42am

Yes, I do. Since that is the default, I am a little bit skeptical turning it on.
Edit: I ran it manually. Usage is still ag 72% according to zpool. But maybe that is just a zpool update thing?

Constantin · August 14, 2025, 1:23pm

Had a look, my sVDEV is only 6% full while the 50x larger Z3 pool is 28% full. Given that I rebalanced the entire pool, I am somewhat flummoxed to have so little in terms of tails. I don’t expect any “guts” since most of my data is incompressible - archives, images, and videos.

Granted, I went on a metadata-destroying crusade by putting a lot of things into sparsebundle archives that only present themselves as larger files to the file system, but still I would have expected a larger “tail” impact than I see here, especially considering that most datasets have a small block cutoff of 512k, for example. That is one step below the 1M recordsize for everything else. Curious!

swc-phil · August 14, 2025, 4:29pm

I’m just suggesting, so can’t say for sure. Btw, @winnielinnie proposed a trim solution as a weekly cron job. Perhaps you should consider adding it.

etorix · August 14, 2025, 5:00pm

This makes sense, but means that the svdev was specifically designed for dRAID and is less suitable for raidz.
Home users hardly ever have a use case for dRAID; this then implies that they hardly have a case for svdev. Hard drive raidz[2,3], optionally with (metadata) L2ARC and/or separate SSD mirror pool for apps/VMs, and that’s it.

mav · August 14, 2025, 5:06pm

True. But I’ve already made several changes for upcoming ZFS 2.4 to start moving its focus. It will be able to play as SLOG. It will be allowed to be RAIDZ (actually any topologies without previous restrictions, so that you could use some 3-wide RAIDZ of SSDs special for some multiple 10-wide RAIDZ2 of HDDs pool). special_small_blocks will be allowed arbitrary values. zfs rewrite will allow to easier migrate data there and back. But yea, it is a good point that it needs some more thinking on the semantics of small blocks.

swc-phil · August 14, 2025, 6:24pm

It can be vice versa:
There is a hardly case for SSD mirror pool for apps/VMs. Hard drive raidz[2,3] ~~optionally~~ with sVDEV and ~~/or~~ separate “SSD-only” dataset for apps/VMs, and that’s it.

etorix · August 14, 2025, 7:10pm

One one hand, the restriction looks strange.
On the other hand, having a raidz2 (4+ drives) special vdev is not attractive if a 3-way mirror does the job with appropriate redunadancy. And a raidz svdev would raise the same issue of low efficiency for small files (blocks) as the main raidz storage.

Until now, I only thought of special vdev as a matter of performance, hence SSD, with low capacity requirements.
But with huge dRAID vdevs a “residual” chunck of a big file that would be to small to store efficiently across the whole dRAID could still be sizeable (up to megabyte size maybe?), and there could be lots of these; so svdev could be a matter of capacity. Then a raidz2 svdev with HDDs may make sense next to a dRAID array.

Mind opening stuff. But not for my use case.