Special VDEV (sVDEV) Planning, Sizing, and Considerations

Constantin · May 24, 2024, 7:46pm

This resource was originally sourced from a Suggestion thread here:
https://forums.truenas.com/t/suggestion-better-svdev-planning-oversight-tools-in-gui/394/

Why consider a sVDEV?

A sVDEV has the potential to significantly speed up the operations of a HDD pool by putting small files and pool metadata into a faster special VDEV (sVDEV), which usually consists of SSDs. You can also designate specific datasets to reside on a sVDEV by setting record size in the dataset configuration page to equal the sVDEV small file cutoff. Thus, the common practice of segregating pools by hard drive type, i.e. Solid State Disk (SSD) vs. Hard Disk Drive (HDD) is no longer strictly necessary.

What about L2ARC?

The main difference between Second Level Adaptive Replacement Cache (L2ARC) and sVDEV is that while L2ARC can cache frequently-accessed files, or metadata, or files and metadata, it is read-only, redundant, and it also needs to get “hot”. Unlike sVDEV, the L2ARC has to “miss” having a file or metadata ready for use in cache before the miss is noted and said file / metadata might be read into the L2ARC for later reuse. L2ARC also requires RAM to function.

The common recommendation used to be that a NAS needed a minimum 64GB of RAM available for TrueNAS for L2ARC, but thanks to ongoing development that L2ARC minimum RAM limit is now considerably lower (see here, thank you @Arwen and @SmallBarky!). As usual, it depends on on the use case, the specifics of the pool, etc. @jro helpfully developed a L2ARC calculator, which should help figure out how much RAM a L2ARC you may be planning on using will actually consume.

Based on my limited testing, it took my NAS about three passes with rsync for a metadata=only L2ARC to cache all the metadata it needed.. As you can see from my test results, using L2ARC solely for metadata sped up rsync tasks significantly. For an in-depth discussion of L2ARC, see this TrueNAS documentation hub page as well as @Arwen’s L2ARC resource. Happily, L2ARC can be made persistent, allowing a “hot” cache to survive reboots. Because L2ARC is redundant, any SSD will do, as long as it’s read performance is decent. You have little to lose, so try it out, an L2ARC may be all you need for your use case…

So why go down the path of a sVDEV?

Unlike L2ARC, the sVDEV doesn’t need more RAM and it’s 100% hot from the start since all metadata will reside on it. Additionally, the sVDEV will also host small files, which are the worst-performing category for HDDs to deal with. For a L2ARC-sVDEV comparison, see my results from a few years ago comparing the impact of L2ARC and sVDEVs in my rsync use case.

Important: the pool depends on the sVDEV to function, so if the sVDEV dies, your pool dies also. Take the same care with sVDEVs regarding redundancy and resilience as you did with the other VDEVs! Do not use crummy SSDs for the sVDEV unless you don’t care about pool life expectancy..

About record sizes and small file cutoffs…

What constitutes a small file depends in part on the recordsize used in your pool / dataset. By default, TrueNAS uses 128k recordsizes but you can adjust them on a per-dataset basis, if you wish (See Storage/Pool/Dataset triple-dot on the right of the GUI). Larger record sizes (up to 1M) used in conjunction with compression are great for media files. Smaller record sizes are helpful for databases and like use cases.

For simple large-file data transfers, adjusting the record size up to 1M with zstd compression and adding the sVDEV, made transfer speeds to my pool jump from about 250MB/s to 400MB/s. Not bad for a Z3 pool consisting of eight He10 hard drives supplemented by the sVDEV.

Planning

Ideally, a special VDEV (sVDEV) is planned carefully in advance and is added to the pool before the pool is populated with data. In order to get moved to the sVDEV, a small file has to be copied into the pool or rebalanced in-situ, so figuring out how much storage the sVDEV of a new pool needs in advance is better than adding one later and rebalancing everything just to migrate small files into the sVDEV.

Similarly, you will want to adjust the record size of each data set ideally in advance to figure out what the best balance between record size and sVDEV use is. The good news is that sVDEV cutoffs can be adjusted by dataset, so there is plenty of flexibility.

This feature is also interesting in that you can make datasets that use the same small file cutoff as the recordsize, ensuring that the entire dataset is stored solely on the sVDEV. Thus, a sVDEV can selectively host all data that needs speed (databases, metadata, etc.) as part of a common pool vs. setting up two pools (one using SSDs, the other HDDs). Thus, sVDEVs can really boost pool performance without actually adding a lot of drives to get SSD performance for the parts of the pool that benefit greatly from SSD use.

Required sVDEV Size

The needed sVDEV size has be determined by how much metadata and small files data needs to be held. By default, the sVDEV features a 25/75% partition split between metadata and small files, respectively. It can be adjusted by adjusting the zfs_special_class_metadata_reserve_pct parameter (thank you, @HoneyBadger!). The sVDEV does not like to be filled more than 75% by default and you also have to allow for pool expansion in the future. Some extrapolation is likely needed.

Determining Metadata Space Needs

There is a rule of thumb that metadata consumes about 0.3% of total pool capacity but this will vary by use case. NAS’ hosting only large video files will have comparatively little metadata compared to pools that host a uncompressed MacOS system folder. If you have the time, there are CLI commands to estimate metadata needs better. In TrueNAS Core as of 13.0U6, the command to determine your metadata space needs is

zdb -LbbbA -U /data/zfs/zpool.cache poolname

That will spit out a series of tables. One will be long… so long in fact that I’ve truncated mine to only show the top and the bottom while cutting out the middle.


Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     24K     12K    4.00     0.00  object directory
                            .... SNIP! .....
 89.0K  2.80G    359M    718M   8.06K    8.00     0.00      L2 Total
 1.61M  51.6G   11.6G   23.2G   14.4K    4.44     0.08      L1 Total
  140M  17.3T   17.2T   30.1T    220K    1.01    99.92      L0 Total
  142M  17.4T   17.2T   30.1T    218K    1.01   100.00  Total

Let me draw your attention to the row with L1 total, i.e. three rows up from the bottom. There is our metadata information i.e. 23.2GB (ASIZE) or 0.08% of my pool was metadata when my record size was 128k (default) and 32kB for the small file cutoff. Extrapolating across 50TB of pool capacity in my use case, that would come out to about a minimum of 100 GB (from 23.2G @ 23% pool fill) potentially needed for metadata.

Once TrueNAS transitions to OpenZFS 2.2.0 or higher, the current metadata space needs will be even easier to determine via zdb -bbbs with a nice summary at the bottom, see here. Now you know one minimum.

Small File Space Needs

The zdb -LbbbA -U /data/zfs/zpool.cache poolname command will also spit out the answer for your small file needs by publishing something called the “Block Size Histogram”. It summarizes the block sizes used, i.e. how many of each type can be found in your pool today. If your metadata is stored entirely on HDDs, this may take a long time to execute.

Block Size Histogram

  block   psize                  lsize                asize
   size   Count   Size      Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   422K   211M   211M   422K   211M   211M      0      0      0
     1K:   113K   123M   334M   113K   123M   334M      0      0      0
     2K:  65.7K   175M   510M  65.7K   175M   510M      0      0      0
     4K:  1.92M  7.71G  8.21G   119K   554M  1.04G  1.01M  4.04G  4.04G
     8K:   383K  3.56G  11.8G  39.6K   436M  1.46G  1.77M  14.6G  18.6G
    16K:   765K  13.2G  25.0G   316K  5.18G  6.64G   297K  5.87G  24.5G
    32K:   368K  16.5G  41.5G  1.80M  59.3G  65.9G   585K  18.8G  43.4G
    64K:   622K  56.3G  97.8G  34.0K  2.73G  68.6G   386K  32.4G  75.8G
   128K:   137M  17.1T  17.2T   139M  17.3T  17.4T   138M  30.0T  30.1T
   256K:      0      0  17.2T      0      0  17.4T    605   164M  30.1T
   512K:      0      0  17.2T      0      0  17.4T      0      0  30.1T

Pay attention to the column “Cum.” on the right side of the table. That shows how much room the small files are taking up in your pool on a cumulative basis. At this stage of the analysis, none of my files show a record size above 128k, since that is block size that this pool was created with. It’s also why the sVDEV cutoff was set well below the pool record size.

Once you have your small file distribution, you need to determine where to set your cutoff re: what small files to send to the sVDEV vs. the general pool. By default, I chose 32kB, which consumed 18.8GB for my small files. For example, if my pool is 23% full and small files needs make up for 18GB as shown above, then a full pool would potentially need 18.8GB / 0.23 , i.e. 90GB for small files.

The minimum sVDEV size is dictated thus by metadata needs (100GB) and the small file cutoff, which at higher small file cutoffs will usually dictate the maximum sVDEV capacity need. Remember the default 75% small file / 25% metadata ratio? So that means 100GB of metadata needs suggests a 100 / 0.25 = 400 GB minimum capacity sVDEV.

Small File Cutoffs further dictate the needs for the minimum sVDEV capacity, i.e. at 23% pool capacity and a 75% ratio,

block
size Cum. → Small File Space Needed → Min sVDEV size (@ 75/25%)
4K: 4.04G → 17.6GB → 23.4GB
8K: 18.6G → 80GB → 106GB
16K: 24.5G → 106GB → 141GB
32K: 43.4G → 189GB → 252GB
64K: 75.8G → 329 GB → 439GB

Based on the discussion above & rounding up, the minimum sVDEV capacity should be around 500GB of enterprise-quality SSD, which in turn suggests that I could set the small file cutoff at 64kb and be fine. That said, you can play with the ratios of small file to metadata allocation and tune the ratio to your use case.

However, you might find that by adjusting the record sizes that the diversity of your pool will increase and more data could be stored in the sVDEV. If you suspect that you will host more smaller files in the future, you would do well to install a commensurately larger sVDEV to account for expansion in the future.

sVDEV drive selection

At minimum, choose SSDs that can handle the workload and a proven track record re: endurance. Power Loss Protection (PLP) is also a nice to have, though that can get expensive. I went for Intel S3610 SATA SSDs that (though used) have incredible write-endurance. I also bought two spares. I qualified all my SSDs before using them, set aside the spares in caddies, ready to mount if one of the four SSDs I use for a sVDEV develop a failure and needs to be replaced.

Because a loss of the sVDEV will result in a pool loss, you will want to mount sVDEV SSDs in mirrors, at minimum in a 2-way. I use a four-way mirror for my Z3 pool.

Implementation

The sVDEV can be enabled / attached / assigned at the GUI level, see here. Make sure all the SSDs you need are in the sVDEV, set up as mirrors, etc.

The small file record size cutoff can be set via the GUI - select “Storage” then “Pools”, followed by going over the pool/dataset table. The pool or individual dataset small file cutoff can be adjusted by opening “edit option” (see three dots to the right of each pool or dataset name in the menu) and adjusting the “Metadata small block size”. If you do it for the pool, the datasets will inherit the pool value by default. You can also do this via the CLI, per this Level 1 post, the command is

zfs set special_small_blocks=64K poolname

or if you want to vary small file cutoffs on a dataset basis:

zfs set special_small_blocks=128K poolname/dataset

If you are about to enjoy a new pool of content, congratulations, you are pretty much done. If you have an extant pool, the small-files benefit won’t really start to accrue until you move the small files onto the sVDEV.

This is also a great time to brush up on recordsize and to determine if you should adjust the record sizes of particular pools to better suit the data in them - you can do this in the GUI by selecting, Storage → Pools and then individual datasets. Lastly, there may be some small pools (think iocage jails) where likely all data should reside on the sVDEV. Go ahead and tune the recordsize cutoff by dataset to your liking.

There is no GUI rebalancing command, though we may get one if ZFS VDEV expansion ever becomes a reality in TrueNAS. In the meantime, there is an excellent script written in bash that can do the rebalancing for you. Before you rebalance, back up your data, turn off snapshot and replication tasks.

I also deleted all snapshots since it becomes very difficult to see the impact of record size changes and rebalancing efforts if past snapshots pollute the data. By design extant snapshots remain static even as the rest of the data in the pool is manipulated by the rebalancing script I referenced above. Snapshot deletion can be tedious and difficult (a reboot may be required) and some snapshots related to iocage and like folders may be off-limits (it’s OK to ignore that one).

In TrueNAS Core, you have to “su” to become the owner of the data before you run the script, I’d also advise to tmux the session to ensure the command can actually execute across the whole dataset. After installing the script in my root folder, my steps are as follows:

cd /mnt/pool/dataset

ls -al (to see who the owner is)

tmux

su owner

bash /root/zfs-inplace-rebalancing.sh --checksum true --passes 1 /pool/path/to/rebalance

then CTRL-B and then “d” to detach. Ideally, do this using SSH to connect to the NAS as the Shell in the TrueNAS 13.0 Core environment is buggy. As of TrueNAS 13.3+, the GUI shell will be removed altogether.

The rebalancing script will run for a while, and as it does, you’ll watch the pool start to redistribute data, especially if you have also tuned the recordsize to be larger for datasets filled with images, archives, or video content. Hopefully, you’ll see the big sets of data migrate to larger recordsizes, allowing you to fine tune the cutoff limits for the sVDEV.

You may have to run the rebalancing script a few times. Only when your pool is “perfect” turn both Snapshot and Replication tasks back on. I would also hit the “Add” button in Storage / Snapshot GUI at this point to manually execute a snapshot. Replication tasks can only start if there is a extant snapshot. Then go to Tasks / Replication and “Run Now” your replication tasks to verify they’re still OK.

Maintenance

Sadly, the current Core dashboard gives ZERO clues re: the sVDEV other than whether the disks are still available. For example, a basic check to see how full the sVDEV is requires CLI commands (i.e. use zpool list -vvv poolname to see how full each VDEV is [thank you, @NickF1227!).

I hope that iXsystems can add a small pane for the sVDEV to supplement the main pool dashboard. It would be helpful for the sysadmin to be warned as the sVDEV reaches its recommended maximum fill (75%) just as it would be ideal if the administrator is warned when a regular pool reaches 80% fill.

Some other considerations

If you made changes to record sizes, etc. for a extant pool, I would ensure that any remote replication target has the exact same settings re: recordsize, compression, sVDEV small file cutoff (if fitted), etc. If your remote NAS already has content, I’d either wipe and start over (safest option) or nuke all remote snapshots, change the recordsizes, etc. to match the primary NAS, rebalance, and only then re-enable replication. (seems like a less safe option)

So what though?

Well, I have found the implementation of sVDEV on my pool to enable a improvement over L2ARC in terms of metadata, even if a L2ARC is “hot”, persistent, and metadata-only (the L2ARC only really helps for read operations re: metadata, not writes, which have to go to the slow HDD pool). Between sVDEV, recordsize=1M, compression=zstd, and a rebalance, my sustained large-file pool write speed has risen to 400MB/s, well above the expected 250MB/s limit I used to experience. The

zdb -LbbbA -U /data/zfs/zpool.cache poolname

command now executes in less than two minutes and the amount of small files stored in the sVDEV has shot up considerably. Meanwhile, my NAS’ metadata needs dropped to just 0.03% of the pool.

17.0T completed (331016MB/s) estimated time remaining: 0hr 00min 00sec
        bp count:              13671449        ganged count:                 0
        bp logical:      12053360033792      avg: 881644
        bp physical:     11620838932480      avg: 850007     compression:   1.04
        bp allocated:    18659308601344      avg: 1364837     compression:   0.65
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        Normal class:    18553586663424     used: 23.21%
        Special class      106690252800     used:  6.68%
        Embedded log class              0     used:  0.00%

        additional, non-pointer bps of type 0:        131
         number of (compressed) bytes:  number of bps

and

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     24K     12K    4.00     0.00  object directory
     1    32K      8K     24K     24K    4.00     0.00      L1 object array
    77  38.5K   38.5K    924K     12K    1.00     0.00      L0 object array
    78  70.5K   46.5K    948K   12.2K    1.52     0.00  object array
                          ------ SNIP! ------
  586K  18.3G   2.61G   5.23G   9.15K    7.00     0.03      L1 Total
 12.5M  10.9T   10.6T   17.0T   1.36M    1.04    99.97      L0 Total
 13.0M  11.0T   10.6T   17.0T   1.30M    1.04   100.00  Total

Note how between record-size changes and rebalancing, the ASIZE L1 has dropped by 4x from 23GB to 5.23GB. In other words, thanks to tuning record sizes to better reflect the content being hosted, my system now has to keep track of ~4x less metadata than it did before I adjusted record sizes for my data sets and rebalanced the pool.

That step also has implications for the sVDEV, i.e. my pool now needs 4x less room on the sVDEV for metadata! Less I/O for the metadata during writes = a faster NAS, all things being equal.

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   238K   119M   119M   238K   119M   119M      0      0      0
     1K:  80.2K  99.0M   218M  80.2K  99.0M   218M      0      0      0
     2K:   126K   334M   552M   126K   334M   552M      0      0      0
     4K:   849K  3.36G  3.90G  97.3K   526M  1.05G   458K  1.79G  1.79G
     8K:  96.8K   914M  4.80G  60.3K   662M  1.70G   916K  7.29G  9.08G
    16K:   135K  2.42G  7.21G   229K  3.86G  5.56G  63.5K  1.28G  10.4G
    32K:   197K  9.10G  16.3G   701K  23.6G  29.1G   267K  11.1G  21.5G
    64K:   144K  11.8G  28.2G  31.6K  2.49G  31.6G   135K  11.0G  32.5G
   128K:   561K  75.9G   104G   620K  79.5G   111G   566K   119G   152G
   256K:   185K  67.5G   172G  57.1K  20.3G   131G   173K  63.0G   215G
   512K:   303K   214G   386G  58.3K  40.9G   172G   105K  85.4G   300G
     1M:  10.2M  10.2T  10.6T  10.8M  10.8T  11.0T  10.4M  16.7T  17.0T
     2M:      0      0  10.6T      0      0  11.0T      0      0  17.0T

So here I am showing the dramatic shift in my pool from files in the 128k block bucket to 1M. My sVDEV could accommodate small files right up to 512k, but I wouldn’t do that since I want to install some Apps, etc. and I want to tune their datasets to use the sVDEV for stuff that needs fast storage. At present the sVDEV is 6% full and the main pool is at 23%, suggesting ample room for growth.

Another Example from @Fastline

So that was my example. Let’s review minimums and maximums re: sVDEV capacity for @Fastline. He posted the diagnostic data we need to come to a informed decision based on his current pool, which is about 53% full. I have truncated some of the previous content for brevity.

As you may recall, the default sVDEV allocation is 25% for metadata, and 75% for small files. So, unless we change the ratio, we have to figure out minimums based on the Metadata and small files separately and choose the higher of the two. Thew small file cutoff should then be set commensurate with sVDEV capacity.

Metadata
Pay attention to the L1 Total block below. It shows 12.9G in the PSIZE column and 35.8GB in the ASIZE for total metadata used, or 0.09% of total pool capacity (second-to-last column). The conservative approach is to use ASIZE for metadata calculations.

Suggestion: Better sVDEV Planning, Oversight Tools in GUI

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
     -      -       -       -       -       -        -  unallocated
     2    32K      8K     48K     24K    4.00     0.00  object directory
... (snip!)...
 31.1K  1005M    137M    530M   17.0K    7.32     0.00      L2 Total
 1.23M  39.4G   12.9G   35.8G   29.1K    3.05     0.09      L1 Total
  260M  32.2T   31.3T   39.1T    154K    1.03    99.91      L0 Total
  261M  32.3T   31.3T   39.2T    153K    1.03   100.00  Total

Given that the pool could grow (though it is not recommended to fill a pool to 100%, ever!) that puts the potential metadata needs at around 35.8 / 0.53 = 67.5GB. At the default 75%/25% ratio of small file vs. metadata sVDEV allocation, that suggests a sVDEV with at least 270GB of capacity, i.e. 35.8GB / 0.53 (pool fill) / 0.25 (sVDEV data vs. small files ratio).

Small Files

Suggestion: Better sVDEV Planning, Oversight Tools in GUI

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   130K  64.8M  64.8M   130K  64.8M  64.8M      0      0      0
     1K:   117K   148M   213M   117K   148M   213M      0      0      0
     2K:  84.6K   216M   429M  84.6K   216M   429M      0      0      0
     4K:  5.83M  23.4G  23.8G  54.7K   302M   732M      0      0      0
     8K:  2.34M  24.0G  47.8G  59.4K   654M  1.35G  5.72M  45.7G  45.7G
    16K:   431K  8.90G  56.7G   198K  3.74G  5.09G  1.91M  31.7G  77.4G
    32K:   726K  32.7G  89.4G  1.33M  43.6G  48.6G  1.65M  60.0G   137G
    64K:  2.38M   222G   311G  57.6K  5.32G  54.0G  1.65M   161G   298G
   128K:   248M  31.0T  31.3T   258M  32.2T  32.3T   249M  38.9T  39.2T
   256K:      0      0  31.3T      0      0  32.3T    131  40.4M  39.2T
   512K:      0      0  31.3T      0      0  32.3T      0      0  39.2T

@Fastline is using mostly default settings for record size, so there are almost no files being stored in blocks above 128k. The next question is where to set the small file recordsize cutoff, which will likely dictate total sVDEV capacity need.

For example, the 64k bucket currently consumes a cumulative 298G, suggesting a sVDEV capacity of 298GB / 0.53 (to account for remaining pool capacity) / 0.75 = 749 GB sVDEV capacity - which I would round up to 768GB or 1TB enterprise SSDs.

If @Fastline sets the small file cutoff at 32k, then the necessary sVDEV SSD capacity is basically cut in half → 137 GB / 0.53 / 0.75 = 344GB, which rounds up to ~500GB in common SSD capacities.

Bottom line, @Fastlline’s SSD sVDEV needs to have at least 270GB of capacity to hold all the metadata - and would have to use no more than 16k as a small file cutoff. To store more small files in the sVDEV or to allow for App and VM growth, use a higher capacity sVDEV.

I hope that helps. Good luck!

NickF1227 · May 24, 2024, 11:50pm

Well done sir!

Constantin · May 25, 2024, 12:16am

nani gigantum humeris insidentes - Bernard of Chartres

Thank you for your help.

dxun · June 9, 2024, 5:18pm

A wonderful, thoughtful and exceptionally useful post. If this isn’t pinned already, it ought to be.

Thank you for this extraordinary useful aggregation of sVDEV info.

JohnDoe · December 23, 2024, 11:31am

Would it be possibole to reduce the sice of a special vdev? vdevs cant be reduced though, that I know of.

Constantin · December 23, 2024, 11:57am

You can change the small file vs. metadata allocation but I don’t think you can adjust the size of the sVDEV itself. I have never tried to remove a sVDEV from my pool, it’s unlikely that good things would happen.

Stux · December 23, 2024, 1:01pm

I think you can remove svdevs if you have no raidz/draid vdevs.

JohnDoe · December 23, 2024, 5:12pm

you cant remove it, since the db and small files could also be on there.

I added drives that are too large to my sVDEV and they are no mostly not used

Constantin · December 23, 2024, 5:27pm

Consider changing the small file cutoff in your data shares so that more small stuff is moved to the sVDEV. You may need to rebalance.

Krill · December 24, 2024, 12:34pm

Stupid question: If the sVDEV is a three way mirror of 2Tb drives with minimal storage used, can you force replace with a smaller drive ie a 1Tb? The intent would be to replace each drive to drop the size of the sVDEV, rather than removing the sVDEV.

I mean, I could test this myself, but it avoids having to offline a server and saves an hour if someone else knows the answer so thought best to ask.

HoneyBadger · December 24, 2024, 2:06pm

No, you can’t replace drives with smaller ones - svdev still has to follow this rule.

If you have the ability to remove the entire special vdev entirely (ie: you aren’t using RAIDZ) then you could build a new svdev of the 1T drives first and then remove the 2T one. That would allow the metadata to flow directly there and not back to the main pool.

But if you’re using RAIDZ you’re restricted to your current setup.

winnielinnie · December 24, 2024, 2:11pm

You actually can, as mentioned by @Stux and @HoneyBadger.

Ditto on this. If your pool only consists of mirrors (or stripes), then even a special vdev can be removed, in which its existing data will be evacuated to the other vdevs (mirrors, stripes) before removal.

Once again, “Mirrors for the win!”

Krill · December 24, 2024, 2:49pm

Thanks for the confirmation But isn’t it restricted to current set up, unless increasing the mirror width, or increasing drive capacity?

Once again, “Mirrors for the win!”

Sure, if you have an extra $3,000 lying around (compared to what I have in place). For that amount of money one could buy enough drives to create an entire new storage pool including a new svdev.

HoneyBadger · December 24, 2024, 8:29pm

Iff (If and Only If) you have mirrors in place, then you have the flexibility described - removing a vdev is possible, so you can create a second special vdev consisting of the 1T drives, then zpool remove through the webUI the first one with the 2T drives.

The presence of raidz anywhere in your pool makes this impossible.

nowm · December 31, 2024, 7:33pm

Is there any performance drawback, except the cost, if we use say 16TB svdev on a pool of 96TB data vdev? Will there be any memory constraints applied to svdev?

Constantin · December 31, 2024, 8:07pm

Zero issues re memory. Your sVDEV will simply deal with metadata and small file read/write requests at SSD speeds (assuming you use a SSD for the sVDEV).

How large to size the sVDEV depends on the intended workload. For example, by default TrueNAS will set aside 25% of the sVDEV for metadata, and 75% for small files.

Based on your research and use case, it might make sense to change that, especially when a sVDEV is as large as the one you are proposing relative to the total HDD pool size.

You can certainly make excellent use of the sVDEV by creating datasets that benefit from fast operations like VMs, databases, etc. by intentionally making their dataset recordsize smaller than the small file cutoff for sVDEV.

nowm · December 31, 2024, 9:03pm

Thanks.
Further, I’m wondering that if the slog is SATA SSD, data vdevs are 8x7200rpm, and svdev is NVMe, will the sata ssd slog

slow down the small file writes to the nvme svdev?
still beneficial for large files that write to the 8x7200 hdds.
I believe it will reduce the incoming write performance if the network bandwidth is large enough say 40/56Gbe?
Will slog serialize a large number random write into sequential write into the spinning drives?
as such if the svdev is a large nvme should we drop the slog for better performance?

Constantin · January 1, 2025, 1:05am

Apologies. Have a house full of friends here. So I will keep it short and try to answer in the AM.

Slog and sVDEV are two very different creatures. One is important for sync writes, the other is for metadata and small files. Use different drives / systems - ie Optane for SLOG and SSD for sVDEV.

[Edit: With the exception of sync writes, ] SLOG does zero unless your system has an unintended shutdown. Never used unless that happens. Otherwise, the data is kept in ram and written in 5s increments, IIRC.

sVDEV is used every time you read / write to and from pool.

Stux · January 1, 2025, 2:35am

Sync data is always written asap, to a ZIL, whether in the main pool, or a Separate Log device (SLOG) then iirc, actually written as part of the TXG.

Point is SLOGs speed up sync writes. They don’t speed up normal writes.

A SLOG drive should have very fast sync write performance, ie Optane used to be the best choice.

dxun · January 1, 2025, 4:34am

… I’m wondering that if the slog is SATA SSD…

To reiterate as others have pointed out - SLOG has a chance of accelerating synchronous writes only.

Moreover, in accelerating such writes, a very important detail is the presence of PLP (Power-loss Protection) feature on the SLOG drive - without it, the SLOG SSD drive will behave basically like a platter drive. This means that you should exclusively be looking at enterprise (non-consumer) SSD drives for SLOG (with a few notable exceptions, e.g. Intel SSD 320)

It is ok to be surprised with this, I certainly was when I first realised this. This post explains this fact well - basically, a stout consumer MLC SSD (Samsung 860 PRO) was yielding around 900 IOPS of sync writes which is really not that much more than, say a 15 k Velociraptor HDD.

Thus, pick your SLOG drives carefully - Optane is by far the safest choice these days.