Analyse usage of special vdev

I was wondering of there are any scripts I am unaware of that help analyzing the usage of special vdev.

I played around with my settings and probably underestimated the amount of sub small files when migrating to my new TrueNAS :slight_smile: Either way, I was surprised to see that my svdev is 73% full.

I set metadata allocation for all my datasets back to 0 after noticing it, but of course that will not do anything retroactively. So now I am wondering if the usage is mostly from my dataset1 or dataset2.
Is there any way to find that out?

I was wondering if by using

sudo zdb -bbb pool/Nextcloud
Dataset pool/Nextcloud [ZPL], ID 554, cr_txg 464, 1.29T, 916802 objects

and in the webGUI the usage of Dataset1 is 1.38TiB, does that mean that 0.09TiB is on special vdev for that dataset?

Don’t think these outputs are relevant, but just in case.

sudo zpool list -v
NAME                                       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
boot-pool                                  222G  5.64G   216G        -         -     0%     2%  1.00x    ONLINE  -
  sdk3                                     223G  5.64G   216G        -         -     0%  2.53%      -    ONLINE
pool                                       146T  34.8T   112T        -         -     7%    23%  1.00x    ONLINE  /mnt
  raidz2-0                                 146T  34.2T   111T        -         -     7%  23.5%      -    ONLINE
    sda2                                  18.2T      -      -        -         -      -      -      -    ONLINE
    0fb00072-1728-11ee-9a37-3cecefeb2d66  18.2T      -      -        -         -      -      -      -    ONLINE
    0f700003-1728-11ee-9a37-3cecefeb2d66  18.2T      -      -        -         -      -      -      -    ONLINE
    sdd2                                  18.2T      -      -        -         -      -      -      -    ONLINE
    sde2                                  18.2T      -      -        -         -      -      -      -    ONLINE
    sdf2                                  18.2T      -      -        -         -      -      -      -    ONLINE
    sdg2                                  18.2T      -      -        -         -      -      -      -    ONLINE
    sdh2                                  18.2T      -      -        -         -      -      -      -    ONLINE
special                                       -      -      -        -         -      -      -      -         -
  mirror-1                                 888G   649G   239G        -         -    72%  73.1%      -    ONLINE
    sdi1                                   894G      -      -        -         -      -      -      -    ONLINE
    sdj1                                   894G      -      -        -         -      -      -      -    ONLINE

and

sudo zdb -bbb pool

Traversing all blocks to verify nothing leaked ...

loading concrete vdev 1, metaslab 110 of 111 .....
34.7T completed (216387MB/s) estimated time remaining: 0hr 00min 00sec        
        No leaks (block sum matches space maps exactly)

        bp count:              29199457
        ganged count:                 0
        bp logical:      29095585112064      avg: 996442
        bp physical:     28819924155904      avg: 987002     compression:   1.01
        bp allocated:    38279242518528      avg: 1310957     compression:   0.76
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        bp cloned:           1022070784    count:   1013
        Normal class:    37581554810880     used: 23.49%
        Special class      696664948736     used: 73.07%
        Embedded log class              0     used:  0.00%

        additional, non-pointer bps of type 0:     611989
         number of (compressed) bytes:  number of bps
                         17:      3 *
                         18:      0 
                         19:      1 *
                         20:      0 
                         21:      8 *
                         22:   1723 *
                         23:    426 *
                         24:     21 *
                         25:     12 *
                         26:     13 *
                         27:    159 *
                         28:  12105 ***
                         29:  84766 ****************
                         30:     11 *
                         31:     99 *
                         32:     17 *
                         33:     13 *
                         34:     10 *
                         35:     13 *
                         36:      7 *
                         37:     41 *
                         38:      9 *
                         39:      8 *
                         40:    143 *
                         41:   1690 *
                         42:     25 *
                         43:     16 *
                         44:    188 *
                         45:    303 *
                         46:  15116 ***
                         47:    562 *
                         48:  32330 *******
                         49: 213359 ****************************************
                         50:   2738 *
                         51:   7319 **
                         52:   5803 **
                         53:  41045 ********
                         54:  65818 *************
                         55:   7460 **
                         56:    373 *
                         57:   1044 *
                         58:   4235 *
                         59:   1196 *
                         60:    390 *
                         61:    283 *
                         62:    893 *
                         63:   3848 *
                         64:   1110 *
                         65:   3710 *
                         66:  21403 *****
                         67:   1407 *
                         68:    324 *
                         69:    443 *
                         70:    495 *
                         71:    625 *
                         72:    523 *
                         73:    373 *
                         74:    476 *
                         75:    585 *
                         76:    536 *
                         77:    421 *
                         78:    966 *
                         79:   2507 *
                         80:   1233 *
                         81:   1911 *
                         82:   4776 *
                         83:  10700 ***
                         84:   6057 **
                         85:   9807 **
                         86:   8215 **
                         87:   1079 *
                         88:   1370 *
                         89:   2363 *
                         90:   2258 *
                         91:    485 *
                         92:    495 *
                         93:    556 *
                         94:    789 *
                         95:   2216 *
                         96:   1413 *
                         97:    411 *
                         98:   1108 *
                         99:   2356 *
                        100:    593 *
                        101:    598 *
                        102:    581 *
                        103:    692 *
                        104:    897 *
                        105:    843 *
                        106:    727 *
                        107:    876 *
                        108:   1207 *
                        109:   1045 *
                        110:    731 *
                        111:    961 *
                        112:   1094 *
        Dittoed blocks on same vdev: 557697

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     24K     12K    4.00     0.00  object directory
     1    32K     12K     36K     36K    2.67     0.00      L1 object array
   148    74K     74K   1.73M     12K    1.00     0.00      L0 object array
   149   106K     86K   1.77M   12.2K    1.23     0.00  object array
     1    16K      4K     12K     12K    4.00     0.00  packed nvlist
     -      -       -       -       -       -        -  packed nvlist size
    43  1.34M    180K    540K   12.6K    7.64     0.00      L1 bpobj
 10.9K  1.37G   48.8M    146M   13.4K   28.66     0.00      L0 bpobj
 11.0K  1.37G   49.0M    147M   13.4K   28.59     0.00  bpobj
     -      -       -       -       -       -        -  bpobj header
     -      -       -       -       -       -        -  SPA space map header
 1.35K  21.6M   5.43M   16.3M   12.1K    3.98     0.00      L1 SPA space map
 12.7K  1.58G    594M   1.74G    141K    2.73     0.00      L0 SPA space map
 14.0K  1.60G    599M   1.76G    128K    2.74     0.00  SPA space map
     4   144K    144K    216K     54K    1.00     0.00  ZIL intent log
 1.11K   142M   4.44M   8.88M      8K   32.00     0.00      L5 DMU dnode
 1.11K   142M   4.44M   8.88M      8K   32.00     0.00      L4 DMU dnode
 1.11K   142M   4.44M   8.88M      8K   32.00     0.00      L3 DMU dnode
 1.11K   142M   4.45M   8.89M   8.00K   32.00     0.00      L2 DMU dnode
 4.88K   624M    209M    417M   85.6K    2.99     0.00      L1 DMU dnode
 97.4K  1.52G    414M    843M   8.66K    3.76     0.00      L0 DMU dnode
  107K  2.69G    641M   1.27G   12.1K    4.29     0.00  DMU dnode
 1.11K  4.45M   4.45M   8.90M   8.00K    1.00     0.00  DMU objset
     -      -       -       -       -       -        -  DSL directory
    15     8K   1.50K     24K   1.60K    5.33     0.00  DSL directory child map
     1    32K      4K     12K     12K    8.00     0.00      L1 DSL dataset snap map
    45   538K    264K    804K   17.9K    2.03     0.00      L0 DSL dataset snap map
    46   570K    268K    816K   17.7K    2.12     0.00  DSL dataset snap map
    27   386K     96K    288K   10.7K    4.02     0.00  DSL props
     -      -       -       -       -       -        -  DSL dataset
     -      -       -       -       -       -        -  ZFS znode
     -      -       -       -       -       -        -  ZFS V0 ACL
     7   224K     28K     56K      8K    8.00     0.00      L3 ZFS plain file
 8.71K   279M   35.1M   70.1M   8.05K    7.95     0.00      L2 ZFS plain file
  270K  8.44G   1.80G   3.60G   13.6K    4.69     0.01      L1 ZFS plain file
 26.7M  26.4T   26.2T   34.8T   1.30M    1.01    99.98      L0 ZFS plain file
 27.0M  26.5T   26.2T   34.8T   1.29M    1.01    99.99  ZFS plain file
 13.7K   438M   54.8M    110M      8K    8.00     0.00      L1 ZFS directory
  643K   778M    151M    484M     771    5.14     0.00      L0 ZFS directory
  656K  1.19G    206M    594M     926    5.90     0.00  ZFS directory
    12     9K      9K     96K      8K    1.00     0.00  ZFS master node
     -      -       -       -       -       -        -  ZFS delete queue
     -      -       -       -       -       -        -  zvol object
     -      -       -       -       -       -        -  zvol prop
     -      -       -       -       -       -        -  other uint8[]
     -      -       -       -       -       -        -  other uint64[]
     -      -       -       -       -       -        -  other ZAP
     -      -       -       -       -       -        -  persistent error log
     3    96K     44K    132K     44K    2.18     0.00      L1 SPA history
   654  81.8M   7.49M   22.5M   35.2K   10.91     0.00      L0 SPA history
   657  81.8M   7.54M   22.6M   35.2K   10.86     0.00  SPA history
     -      -       -       -       -       -        -  SPA history offsets
     -      -       -       -       -       -        -  Pool properties
     -      -       -       -       -       -        -  DSL permissions
     -      -       -       -       -       -        -  ZFS ACL
     -      -       -       -       -       -        -  ZFS SYSACL
     -      -       -       -       -       -        -  FUID table
     -      -       -       -       -       -        -  FUID table size
     2     2K   1.50K     12K      6K    1.33     0.00  DSL dataset next clones
     -      -       -       -       -       -        -  scan work queue
 2.69K  1.35M      2K     16K       5   689.75    0.00  ZFS user/group/project used
     -      -       -       -       -       -        -  ZFS user/group/project quota
     -      -       -       -       -       -        -  snapshot refcount tags
     -      -       -       -       -       -        -  DDT ZAP algorithm
     -      -       -       -       -       -        -  DDT statistics
    87  69.5K   69.5K    696K      8K    1.00     0.00  System attributes
     -      -       -       -       -       -        -  SA master node
    12    18K     18K     96K      8K    1.00     0.00  SA attr registration
    30   480K    120K    240K      8K    4.00     0.00  SA attr layouts
     -      -       -       -       -       -        -  scan translations
     -      -       -       -       -       -        -  deduplicated block
 3.17K   101M   12.7M   38.0M     12K    8.00     0.00      L1 DSL deadlist map
 54.2K   867M    421M   1.23G   23.3K    2.06     0.00      L0 DSL deadlist map
 57.4K   968M    433M   1.27G   22.7K    2.23     0.00  DSL deadlist map
     -      -       -       -       -       -        -  DSL deadlist map hdr
     2     2K   1.50K     12K      6K    1.33     0.00  DSL dir clones
     4   512K     16K     48K     12K   32.00     0.00  bpobj subobj
     -      -       -       -       -       -        -  deferred free
     -      -       -       -       -       -        -  dedup ditto
     9   232K     36K    108K     12K    6.44     0.00      L1 other
 3.35K  3.55M   1.98M   40.0M   12.0K    1.79     0.00      L0 other
 3.35K  3.78M   2.01M   40.1M   12.0K    1.88     0.00  other
 1.11K   142M   4.44M   8.88M      8K   32.00     0.00      L5 Total
 1.11K   142M   4.44M   8.88M      8K   32.00     0.00      L4 Total
 1.12K   142M   4.47M   8.94M      8K   31.85     0.00      L3 Total
 9.82K   421M   39.5M   79.0M   8.04K   10.66     0.00      L2 Total
  293K  9.60G   2.08G   4.17G   14.6K    4.63     0.01      L1 Total
 27.5M  26.5T   26.2T   34.8T   1.26M    1.01    99.99      L0 Total
 27.8M  26.5T   26.2T   34.8T   1.25M    1.01   100.00  Total
 1.10M  16.6G   3.73G   8.72G   7.91K    4.45     0.02  Metadata Total

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:  38.1K  19.0M  19.0M  38.1K  19.0M  19.0M      0      0      0
     1K:  47.0K  54.7M  73.8M  47.0K  54.7M  73.8M      0      0      0
     2K:  35.9K  91.8M   166M  35.9K  91.8M   166M      0      0      0
     4K:   388K  1.54G  1.70G  38.9K   217M   382M   117K   466M   466M
     8K:   213K  2.16G  3.87G  46.7K   517M   899M   424K  3.55G  4.00G
    16K:  34.1K   729M  4.58G   208K  3.41G  4.29G   198K  4.44G  8.44G
    32K:  39.1K  1.77G  6.35G   320K  10.3G  14.6G  42.4K  1.91G  10.4G
    64K:  51.6K  4.67G  11.0G  30.3K  2.72G  17.3G  49.8K  4.53G  14.9G
   128K:  68.4K  12.7G  23.7G  63.1K  9.45G  26.8G  49.3K  9.58G  24.5G
   256K:   159K  60.6G  84.2G  22.4K  7.97G  34.8G   107K  40.9G  65.4G
   512K:   312K   226G   310G  26.6K  19.5G  54.3G   250K   187G   252G
     1M:  25.9M  25.9T  26.2T  26.4M  26.4T  26.5T  26.1M  34.6T  34.8T
     2M:      0      0  26.2T      0      0  26.5T      0      0  34.8T
     4M:      0      0  26.2T      0      0  26.5T      0      0  34.8T
     8M:      0      0  26.2T      0      0  26.5T      0      0  34.8T
    16M:      0      0  26.2T      0      0  26.5T      0      0  34.8T

The amount of sub small special blocks. AIUI, it’s about blocks, not files.

Let’s assume you have a dataset with 128K recordsize and 32K special_small_blocks. Now, a 160KB file would consist of 2 blocks – one of 128KB and one of 32KB. And that second one would go into sVDEV.

Don’t know. But there is a GitHub issue (with no activity).

1 Like

I believe your interpretation is incorrect but I’ll page @winnielinnie to be sure.

As I understand it, a file that is larger than the small file cutoff limit stays 100% in the VDEV. The leftover “tail” of a file is stored in the next record that is available in the VDEV and empty space is compressed away.

Record sizes matter a lot re: performance because record sizes aligned with the use case significantly reduce metadata and overhead. For example, a 1M record size is usually a great idea for a data set that will hold large files like images and videos. Small recordsizes make a lot of sense for databases or VMs where a lot of small content can change quickly.

The mismatch between the sVDEV capacity used and the general VDEV usage suggests a small file cutoff set too aggressively high or the need to at least triple the installed sVDEV capacity.

At this level of sVDEV fill, I believe ZFS is already stashing small files into the general VDEVs or is about to. Depends a bit on how you allocated metadata vs. small files in the sVDEV.

Cutoff set too aggressively I guess. It is not that many files, and if it were only set to metadata (0) I would probably be way below 100GB usage.

Hehe, that is the thing, I changed and played around, I don’t know what files are on sdev and which one are not.

It is also not that big of a deal, since I am still under at 73.1% and it stayed there over the last months. But if gets closer to the 75%, I might need to think about the rebalancing tool.

@swc-phil’s interpretation is correct. The 160KB file will have its first 128K block stored on HDD, while its second block, compressed down to 32KB, will be stored on the NVMe.

This is yet another thing about developer decisions I disagree with when it comes to OpenZFS.

I think the whole concept of a “special small blocks” vdev is silly. The real reason people use such a setup is for the sake of quickly retrieving small files. To add to the confusion, many users possibly confuse it with metadata, which is a different type of block altogether.


Retrieving metadata from a faster device makes sense, as the results are felt by the user: Browsing directories, apps scanning through many files, rsync crawling entire roots, and other typical usages.

Retrieving small blocks from a faster device doesn’t really make much sense. What is it about your 32KB text file, comprised of a single “small” block that makes it more important than a 96KB text file, comprised of a single “not small” block? The same can be said for binary files.

Why do you want 128K uncompressible blocks to be stored and retrieved from spinning HDDs, but compressible blocks or “tail” blocks, from the same file, to be stored and retrieved from NVMe?

I can understand the case for a dedup or metadata special vdev. I just don’t see the practicality of a “small blocks” vdev, especially when the blocks of a file can vary in size, depending on compressibility and if they happen to be at the “end” of a file.

You also risk trapping yourself into a situation where you no longer wish to use a special vdev, if any vdev in your pool is RAIDZ. You won’t be able to remove the special vdev.

3 Likes

Great input, thanks guys! Haven’t even thought about that.

Well, it makes sense to tackle the RAIDZ or dRAID efficiency problem, doesn’t it? Like I have a dRAID with let’s day 20 data drives. This means ever file needs to be at least 80k (for 4k drives) so I could set 80k as the cutoff to not waste storage?

The special vdev needs to have equal redundancy as the data vdevs to keep your entire pool safe. This means for RAIDZ2 data vdevs, your special vdev should be comprised of a 3-way mirror. That’s three extra NVMe’s. The additional hardware doesn’t seem very “efficient”.

That is assuming that all drives are equal and that all drives have the same resilver times.

Again, I believe that a 8 wide RAIDZ with 2 HDD vendors has higher chances of failing (by 3 disks dying from a bad batch) than a SSD mirror (by two disks dying) from two vendors.

Especially considering the bathtub curve in a home lab scenario.
Github argument about reliability calculators

That is why I have no problem running a RAIDZ2 with an special vdev mirror, even though I know that if either vdev goes offline all data is lost.

I would not use it in production though.

1 Like

That makes sense.

The user must still keep in mind that if any vdev in their pool is RAIDZ, they cannot remove the special vdev later.


Other than metadata, I don’t get the point of arbitrarily putting “small blocks” on a special vdev.

Let’s think about it for a minute.

You have a dataset with a recordsize of 1M. You configure a special vdev to store “small blocks” of 32K or less.

If you’re saving images and videos, then it means the last block of every file that happens to be 32K or smaller (after compression) will be saved on the NVMe’s instead of the HDDs.

Why would you want this? Why do you want your HDDs to read all data blocks of image and video files, yet the very last blocks of these files will be “quickly” retrieved from the NVMe, just because they so happen to compress less than 32K, due to the padding of null bytes at the end of the file?

2 Likes

:100:

Because more 32KB files would fit in the faster & smaller storage tier. Thus, saving more IOPS of the slower storage (in the full readout workload??). At least that’s how I see it.

I suspect that it was a design limitation. Save-dispatcher(?) would need to be aware of the entire file (instead of just one block). It’s a pure speculation.

I guess what I’m trying to get at is the functional difference, based on the types of reads and files.

Why would a bunch of 32KB binary, config, or other small files be accessed more frequently and randomly than the same types of files that happen to be slightly larger than 32K?

A file’s type is a better indicator of access habits instead of a file’s size. ZFS can’t really know this. The best it can offer is an intelligent ARC.

Ironically, that would also apply to the last blocks of large files that happen to compress below the threshold. :laughing: In that situation, the only IOPS that would be spared are when reading the very end of a file… which is already being pulled from the HDDs anyways.

“The first 2 MB’s of your 2.025-MB photo I will read from the spinning HDDs. But that last block at the very of the file? I’ll read that from the NVMe!”

No one who thinks about adding and configuring a special vdev actually wants that to happen. They have other ideas about special vdevs.

1 Like

We are on the same page about tail blocks. Actually, I assume that tail blocks caused @Sara’s high sVDEV utilisation.

If they are accessed equally frequently, then my take about IOPS is totally legit.

2 Likes

I wish the ZFS devs were too. :smile:

Not sure how much they would need to rewrite the code to only apply “special small blocks” to single-block files. After all, it’s only the first block of a file than can be stored as a special “variable size” block before compression, if it is the only block that comprises the file.

If it was done this way, to only target single-block files, then it means an entire 2.025-MB photo will be saved on the spinning HDDs. A small file, however, that is only 32-KB or smaller, will be saved on the special vdev. This is what I believe users are expecting when they add and configure a “special small blocks” vdev. I don’t think they assume that the tail blocks will also be saved on the special vdev.

Sadly this is not how a special vdev works with “small blocks” with OpenZFS. :confused: I have a feeling that when it was implemented back then, they didn’t really consider typical use-cases.

1 Like

Hey would you mind pointing me to the right place to follow up on this issue? I’d love to confirm this behavior since it runs counter to intuition and my very limited sVDEV capacity use despite massive photo collections, Time Machine bands and like opportunities for file tails to clog up my sVDEV.

Not at home, but I’ll try to pull up where I read about it. I think it was on HackerNews, Github, or Reddit. I saw it explained by Rincebrain and @mav.

I don’t think it counts as an “issue” because it’s not considered a bug. It was designed like this.

@Sara, what is your recordsize, and what was the special_small_blocks value? Let’s make some rough estimations.

Recordsize is apparently 1M… And there is “only” 252GB blocks with size <=512K. Something doesn’t add up. Unless you have “ssd-only” datasets.

You can only make really rough estimations, since I changed the spcial_small_blocks value multiple times :slight_smile: Had it at 128k, at 64k, also at 16k for some time, but for almost a year it is back to 0. But since a lot of my data is probably unchanged, that does not mean much, because changing does not do anything retroactively.

Not sure what you mean, yes recordsize is 1M and almost all blocks are 1M in size according to the block size histogram.

At that point the use case was to allow more efficient storage for small blocks, that would require too much overhead if stored on some wide dRAID normal vdev. And from that perspective it does not really matter if it is the first, the only block or any other block.

But there is indeed possible (and I agree quite likely) another use case of storing small files on special vdev to reduce HDD head seeks, while large files can be accessed there more efficiently thanks to read-ahead and write-back. For this use case the code is indeed not very optimized now. Though the spa_preferred_class() function making the decision does have zio_t argument, so from io_bookmark it in theory could be able to distinguish first blocks of a file from any others. But it can’t say whether it is the only block of a file.

4 Likes

This is the assumption that I believe many end-users (non enterprise) have when they add a special vdev to house “small blocks”.

See above, as @Constantin can attest to this expectation.

1 Like