Analyse usage of special vdev

There is: Flexibility.

With bulk storage on HDD raidz and apps/VMs on a SSD mirror, each class of storage is monitored on its own, and evolves on its own terms.

With a single pool raidz+svdev, everything is tied. There’s an extra slightly hidden parameter to monitor: Filling of the special vdev—you don’t want it to fill and overspill metadata (and soon zvols) on the HDDs. You need to carefully plan the size of the special vdev, except that the “tail” blocks make it hard to predict. If you need to enlarge the svdev afterwards, tough luck: SATA SSDs are a dying breed, limited in size; NVMe SSDs are very alive and growing, but ports are in short numbers on motherboards other than the EPYC 5k+/Xeon Scalable class; if you created the svdev as a 3-way mirror, you’re locked with it because of the HDD raidz, you can’t change it for a more space efficient raidz, you have to add another mirror vdev—and find the ports. And there is potentially a double constraint on ports (SATA/SAS + NVMe ports) if even have to move the pool to another server.
Nothing that can’t be solved, but a lot of potential headaches. Caveat emptor.

3 Likes

Well, performance gains come with a price.

Why do you need to carefully plan the size if you don’t plan with a separate ssd pool scenario? Just don’t use 0<special_small_blocks<recordsize as you don’t use it with 2 pools.

How does that differ from a separate ssd pool?

Just as you can’t change it with a separate pool. Although in a separate pool case, migration would be easier.


I’m just trying to compare apples to apples.

1 Like

Caveat: This post is from someone with extremely limited technical ability beyond that of reading already created guides.


On this note, I’ve been tentatively exploring moving away from Truenas CE given the Incus debacle. I don’t know if I will at this point, but I want the option if I have to execute.

Given my main storage pool is ten 16TB drives in RaidZ2 with a three wide mirror of NVMe drives, I’ve come to the conclusion that the only viable path out is for a full back up on a separate machine, likely a true file server just too hold the data, and then destroy the sVDEV.

Given this is a plan to move up to 100TB of data, that back up server is likely to be somewhat costly in terms of disks (even if everything else is insanely cheap eg refurbished Supermicro CSE 846 costs less than a single 20TB disk in the UK, and I can find them with 3 year warranty). Notwithstanding that data should be backed up anyway, this moves the needle from “Should have” to “Must have”.

As it is I’m stuck exploring the viability (for someone of my technical ability) using Proxmox with virtualised Truenas, plus a Truenas back up server: Stick with simple NFS, and SMB for phone access.


However, to provide something useful to the thread, I can provide a block histogram for a pool that saves mostly large single files, (video files in the GB size, music files in the 10s of MB size), special_small_blocks=1M, and a record size of 2M. I imagine this is close to the most aggressive setting possible, and the pool only stores media files, no apps etc, and even then a three wide mirror of 2TB drives is likely only enough to manage a single 100TB vdev. special vdevs really do need to be sized appropriately, and likely oversized.

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   142K  71.2M  71.2M  28.5K  14.3M  14.3M      0      0      0
     1K:  66.4K  73.5M   145M  94.0K  95.1M   109M      0      0      0
     2K:  13.3K  36.4M   181M  2.85K  7.29M   117M      0      0      0
     4K:   230K   941M  1.10G  1.82K  10.0M   127M  5.88K  23.5M  23.5M
     8K:   108K  1.13G  2.23G  3.67K  43.9M   171M   429K  3.40G  3.42G
    16K:  10.4K   238M  2.46G  53.4K   889M  1.03G   133K  2.90G  6.33G
    32K:  18.2K   846M  3.29G   383K  12.2G  13.2G  18.3K   827M  7.14G
    64K:  30.6K  2.75G  6.04G  15.7K  1.42G  14.6G  28.1K  2.52G  9.66G
   128K:  40.4K  7.41G  13.4G  33.4K  5.29G  19.9G  42.3K  7.71G  17.4G
   256K:  60.9K  22.4G  35.9G  19.1K  6.76G  26.7G  64.2K  23.6G  41.0G
   512K:   126K  94.5G   130G  19.6K  14.4G  41.1G   126K  94.5G   135G
     1M:   486K   566G   697G   389K   398G   439G   421K   493G   628G
     2M:  25.5M  51.0T  51.7T  25.8M  51.6T  52.0T  25.6M  64.1T  64.7T
     4M:      0      0  51.7T      0      0  52.0T      0      0  64.7T
     8M:      0      0  51.7T      0      0  52.0T      0      0  64.7T
    16M:      0      0  51.7T      0      0  52.0T      0      0  64.7T
Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     24K     12K    4.00     0.00  object directory
     1    32K     12K     36K     36K    2.67     0.00      L1 object array
   149  74.5K     74K   3.75M   25.8K    1.01     0.00      L0 object array
   150   106K     86K   3.79M   25.8K    1.24     0.00  object array
     2    32K      4K     12K      6K    8.00     0.00  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
     1    32K      4K     12K     12K    8.00     0.00      L1 bpobj
    48     6M    268K    804K   16.8K   22.93     0.00      L0 bpobj
    49  6.03M    272K    816K   16.7K   22.71     0.00  bpobj
     -      -       -       -       -       -        -  bpobj header
     -      -       -       -       -       -        -  SPA space map header
 1.83K  29.2M   2.69M   22.9M   12.6K   10.87     0.00      L1 SPA space map
 12.0K  1.50G    423M   1.63G    139K    3.65     0.00      L0 SPA space map
 13.9K  1.53G    425M   1.66G    122K    3.69     0.00  SPA space map
    13   468K    468K    468K     36K    1.00     0.00  ZIL intent log
    49  6.12M    193K    440K   8.98K   32.50     0.00      L5 DMU dnode
    49  6.12M    193K    440K   8.98K   32.50     0.00      L4 DMU dnode
    49  6.12M    193K    440K   8.98K   32.50     0.00      L3 DMU dnode
    50  6.25M    197K    452K   9.04K   32.49     0.00      L2 DMU dnode
    85  10.6M   1.63M   3.49M   42.0K    6.52     0.00      L1 DMU dnode
 20.3K   325M    102M    241M   11.8K    3.18     0.00      L0 DMU dnode
 20.6K   360M    104M    246M   11.9K    3.45     0.00  DMU dnode
    57   228K    224K    492K   8.63K    1.02     0.00  DMU objset
     -      -       -       -       -       -        -  DSL directory
    38    20K   3.50K     84K   2.21K    5.71     0.00  DSL directory child map
    36    18K     512     12K     341   36.00     0.00  DSL dataset snap map
    73  1.10M    276K   1.80M   25.3K    4.06     0.00  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS V0 ACL
 19.0K   608M   44.2M    152M   8.01K   13.76     0.00      L2 ZFS plain file
  339K  10.6G   1.77G   4.08G   12.3K    5.99     0.01      L1 ZFS plain file
 26.3M  52.0T   51.7T   64.7T   2.46M    1.01    99.99      L0 ZFS plain file
 26.6M  52.0T   51.7T   64.7T   2.43M    1.01    99.99  ZFS plain file
 12.7K   407M   32.6M    104M   8.15K   12.50     0.00      L1 ZFS directory
 40.1K   429M    105M    322M   8.03K    4.07     0.00      L0 ZFS directory
 52.8K   836M    138M    426M   8.06K    6.07     0.00  ZFS directory
    35  17.5K   17.5K    632K   18.1K    1.00     0.00  ZFS master node
     -      -       -       -       -       -        -  ZFS delete queue
     -      -       -       -       -       -        -  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
     -      -       -       -       -       -        -  persistent error log
     1    32K      4K     12K     12K    8.00     0.00      L1 SPA history
    28  3.50M    349K   1.43M   52.3K   10.27     0.00      L0 SPA history
    29  3.53M    353K   1.44M   50.9K   10.24     0.00  SPA history
     -      -       -       -       -       -        -  SPA history offsets
     -      -       -       -       -       -        -  Pool properties
     -      -       -       -       -       -        -  DSL permissions
     -      -       -       -       -       -        -  ZFS ACL
     -      -       -       -       -       -        -  ZFS SYSACL
     -      -       -       -       -       -        -  FUID table
     -      -       -       -       -       -        -  FUID table size
     1  4.50K   4.50K     24K     24K    1.00     0.00  DSL dataset next clones
     -      -       -       -       -       -        -  scan work queue
   147  90.5K     44K    432K   2.94K    2.06     0.00  ZFS user/group/project used
     -      -       -       -       -       -        -  ZFS user/group/project quota
     -      -       -       -       -       -        -  snapshot refcount tags
     -      -       -       -       -       -        -  DDT ZAP algorithm
     -      -       -       -       -       -        -  DDT statistics
  111K  99.3M   61.3M   1.04G   9.62K    1.62     0.00  System attributes
     -      -       -       -       -       -        -  SA master node
    35  52.5K   52.5K    632K   18.1K    1.00     0.00  SA attr registration
    70  1.09M    280K   1.20M   17.6K    4.00     0.00  SA attr layouts
     -      -       -       -       -       -        -  scan translations
     -      -       -       -       -       -        -  deduplicated block
    90    45K     512     12K     136   90.00     0.00  DSL deadlist map
     -      -       -       -       -       -        -  DSL deadlist map hdr
     1  4.50K   4.50K     24K     24K    1.00     0.00  DSL dir clones
     -      -       -       -       -       -        -  bpobj subobj
     -      -       -       -       -       -        -  deferred free
     -      -       -       -       -       -        -  dedup ditto
    27   864K    101K    324K     12K    8.55     0.00      L1 other
   172  1.43M    310K   2.09M   12.4K    4.73     0.00      L0 other
   199  2.27M    410K   2.40M   12.4K    5.67     0.00  other
    49  6.12M    193K    440K   8.98K   32.50     0.00      L5 Total
    49  6.12M    193K    440K   8.98K   32.50     0.00      L4 Total
    49  6.12M    193K    440K   8.98K   32.50     0.00      L3 Total
 19.1K   615M   44.4M    153M   8.01K   13.84     0.00      L2 Total
  354K  11.0G   1.81G   4.21G   12.2K    6.11     0.01      L1 Total
 26.4M  52.0T   51.7T   64.7T   2.45M    1.01    99.99      L0 Total
 26.8M  52.0T   51.7T   64.7T   2.41M    1.01   100.00  Total
  558K  14.0G   2.53G   7.59G   13.9K    5.54     0.01  Metadata Total
2 Likes

I’m not following the need to remove the sVDEV if you move your pool to a different appliance / OS. but do keep ZFS. You can easily export the pool and reimport a fusion pool into any modern implementation of ZFS. Especially if your ratio of metadata to file data remains close to 100:1 (!!!).

There is ample room left on your sVDEV and it’s another illustration of why I was skeptical that large file “tail” blocks are stored on the sVDEV. Per that logic, your files are either incredibly large, tails are commonly larger than the 1M small block cutoff, or some other magic is happening.

I’ll just note how little small block file storage is going on in your dataset after accounting for metadata. I thought my dataset was pretty tight, lol. :grinning:

That should be: Passthrough drive controllers, and blacklist the controllers.
A HBA is a controller. A NVMe drive is its own PCI controler.
So if the Proxmox host has a -16i HBA (or a -8i and an expander) and slots for the three NVMe in the special vdev, you can just move your pool in, passing through and blacklisting the HBA and the three NVMe drives.
(Replace the three NVMe by three SATA SSDs, and the whole pool can hang on just the HBA.)

:+1:
Useful dta point, showing a svdev at around 1% of total storage in this case of “mostly large files”.

1 Like

For SOHO applications, ProxMox seems like the best solution for those that want to run proper VMs, consolidate multiple devices into one appliance?

Especially if it has a bunch of network ports that allow
even physical separation of VLANs / network segments, whatever. That seems to be the intent of some of those mother boards out there that have 8 NIC jacks, no?

Proxmox seems to have come up with a stable, time-tested solution re: virtual machines just as iXsystems has come up with a time-tested NAS implementation.

The silent corruption issues associated with TrueNAS under Proxmox have me concerned, however. Is that problem avoided when the HBA is blacklisted as described? I thought there was more to that than isolation but I’m happy to be wrong.

With this setting and all files being>2M, (approximately) every other tail would end up on sVDEV. I wonder what the end goal of this setting was… Looks unreasonable to me.

As a rule of thumb, metadata is about 0.3% of the data size. For 128K records… So it is 300GB of metadata for 100TB of data (with 128K records). Or it’s 18.75G with 2M records. Your metadata is 14G for 65T of data. Looks like the rule kinda works. I, for one, have 6GB of metadata for 16TB of data. My pool has different files (not only media), but for the media I used 4M records.

The takeaway – 128GB sVDEV would probably be enough in the general case (with only metadata performance boost in mind) for sub-100TB pools. As well as all these “2x256G SSD for my apps and VMs”. Although no VMs can be saved on sVDEV as of now…

2 Likes

I’d counsel some caution re: general statements. A lot re: sVDEV use has to do with what is being stored, to what extent the pool can rebalance its contents into the sVDEV, and so on. We shouldn’t jump to conclusions just because our use case indicates something or other.

For example, I significantly shrunk my metadata requirements and small file instances by consolidating a huge amount of files into archives. Not everyone will
want to / can do that.

In relation to destroying the svdev, I suppose I skipped a point, which is related to etorixs’ post:

I understand the process, but do I trust myself not to screw it up? Hell no. I can probably export the pool successfully unless fat fingers strikes, but I’d have to have the NVMe drives attached to proxmox before they could be blacklisted, so there’s definitely room for error here (although add one drive at a time means it would not be a single point of failure).

So yes, I don’t need to “destroy” the svdev, I’d try to import it, but I wouldn’t start the transfer process without having the data backed up in case I make mistakes. But even if it’s a perceived skill issue, it doesn’t negate the point IMO that an svdev does increase a level of risk, and hence increases the need for viable back up systems, and hence increases the cost overhead of a home lab. It’s just a point for people looking at an svdev to consider IMO.

Also, I’d have to get an expander, given I’m using the SFF-8654 for connecting the HDD (well, I’m actually using something like 17 SATA drives including the dual boot), not that this is a significant cost. Although I do need to get a LSI-9300 as well (and drop the 9207 into the back up system).

tl;dr The setting is designed to keep large media files on the HDD but allow the smaller media files and “meta data” (such as associated picture files and text files). and potentially apps datasets, on the faster NVMe drives.

Firstly, I’d echo Constantins’ point: generalizations with svdev seem risky: the svdev is essentially an exception to an exception, with at least two levels of control (record size, small record size) which are easily controllable and other influences which are less controllable (tail sizes etc) which is a function of the files kept on the pool.

Not all files are large files. For example, “meta data” (I suppose the definition is not strictly media files, poor communication on my part previous, sorry) are also stored in datasets on this pool. The advantage of setting the small_blocks_size so high is that all of this data is kept on the NVMe drives for faster access. This inflates the amount of storage past 450GB at current levels, and also circumvents the problem if (or rather, when), I start saving media data such as epub files into the same storage pool. Plus I can use the svdev as a potential location for an apps dataset (I use a separate mirrored NVMe pool for apps and VMs).

It’s worth pointing out that the “meta data” files are going to be kept somewhere, and it would be better that they are not kept on the HDD, clogging it up. If anything, this is the direction that mav et all seem to be looking at. Is it more expensive than not using an svdev? Yes. But is it drastically more expensive than using smaller NVMe drives when I’ve already invested in the Epyc infrastructure? Not really, as I did reuse a 2TB drive (980 Pro) and bought two dramless WD SN770 drives on special offer. I think it cost me about the same as if I’d bought three 500GB NVMe drives. If I’d bought all three 2TB drives new I was looking at another £80.

So I feel comfortable with this approach of over-provisioning the svdev, even if I’m now less comfortable with maintaining a Truenas system compared to last year.

1 Like

Dumb question: did you rebalance your pool post sVDEV install or set your pool up with a sVDEV prior to filling it?

This is different than the special small blocks class. No matter what you specify as the small blocks threshold, all metadata will be written to the special vdev. You can disable the special small blocks value, so that only metadata is written to the special vdev, while all data blocks are written to the storage vdevs.

This will grant you a notable performance boost, since even if metadata gets evicted from the ARC, it can be pulled very quickly from your NVMe’s.

Performance-wise, you’ll notice the benefits of metadata on SSD/NVMe way more than any potential performance gains from “small blocks” on a special vdev, especially if they happen to be “tail” or “intermediary” compressed blocks.

If the HDD is already spinning, seeking, and loading the data blocks of a file, the little bit of “offloading” for small blocks from the NVMe (of the same file) is nothing compared to the effects of fast-loading metadata.

That’s why I think there’s a confusion between home users and enterprise, dRAID, and very wide RAIDZ systems. The home user has in mind performance increases, rather than a bit more “efficiency” of storing small data blocks.


EDIT: Alternatively, there’s a safer method to achieve performance gains in regards to metadata.

First, increase RAM. See if that helps. Later, you can try to adjust the arc.meta_balance tunable. If reboots happen too often or your metadata is somehow really massive, then your last option is to add an NVMe as a persistent L2ARC (secondary cache)[1] to house only metadata.


  1. Unlike a special vdev, an L2ARC only requires a single device. It doesn’t need any sort of redundancy. Losing it will not destroy your entire pool. It can be removed in the future. ↩︎

2 Likes

Constantin: raidz expansion for 6 to 10 drives, added the svdev, rebalanced everything last August, then added about 18TB of new data.

Winnie: Yes, but I’m taking it as I read that files are quicker accessed off NVMe than HDD. I used the term “meta data” in quotation marks to denote actual files, so it’s worth pointing out that the strategy behind the setup I posted is different to the usual suggested strategy. In my defence (of the poor communication strategy of using the same term to define two different concepts), this was how Honeybadger describes svdevs on a previous T3 podcast. If there is a better term I’m all ears.

:+1:

:expressionless:

I want to add that the efficiency should be calculated as well. In the aforementioned 9-wide dRAID example, we are trading 9 HDD ashifts for (a minimum of) 2 SSD ashifts. Depending on the prices, it can be not a good trade. Perhaps dRAIDs are usually (much) wider, but it can still be an inefficient trade in the case of SSD being optane, for one.

2 Likes

Hmm, I litearlly just bought 3 SAS SSD’s to act as special vdev for metadata. I am now wondering if I should do this instead, and return 2 of them… I am a simple home user, but I know I am thrashing my Z2 10 wide array fairly hard these days. Well, not THAT hard compared to some I am sure, but still, homelab is a fun hobby and per usual, I figured playing with some new tech is fun… But maybe this is a more inteligeant way of doing this. Hmm.

This whole small block debate has really got me thinking about how to best do this. I planned on setting small block to 32kb and record size to 1M (maybe 2 depending on feedback, seems to be yet another hot debate).