Recommended recordsize for my setup?

Ok, this is going to be long winded, and more then likely I’m just overthinking things, but I would like some input and recommendations on this.

Because of the number of files housed on my TrueNAS box (7M+), as well as amount of data (30TB+), I’m planning on redoing some settings, as well as possibly adding a sVDev primarily focused on housing the metadata vs small files (increasing the % of metadata on the sVDev) to help reduce backup times for my pool (making a 2nd TrueNAS strictly to mirror [backup] the primary NAS).

After reading through special-vdev-svdev-planning-sizing-and-considerations, I’m still unsure what recordsize I should use that would work best on my system. The data is almost 100% fixed data, no VM or databases running off the NAS, so I shouldn’t need a small recordsize, but because of the number of “small” files in the mix, I also shouldn’t use too large a recordsize? :man_shrugging:

This is my what ZFS outputs for my data with a default 128k recordsize:

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:  13.3K  6.65M  6.65M  13.3K  6.65M  6.65M      0      0      0
     1K:   445K   460M   466M   445K   460M   466M      0      0      0
     2K:  41.0K   102M   568M  41.0K   102M   568M      0      0      0
     4K:  5.55M  22.5G  23.1G   136K   889M  1.42G      0      0      0
     8K:  3.09M  35.3G  58.4G  1.64M  19.6G  21.0G   537K  4.19G  4.19G
    16K:  1.19M  25.1G  83.5G  1.16M  20.0G  41.1G  7.84M   134G   138G
    32K:  2.33M   108G   192G  5.49M   180G   221G  2.99M   117G   255G
    64K:  5.48M   513G   705G   295K  27.7G   249G  4.19M   396G   650G
   128K:   236M  29.6T  30.3T   245M  30.7T  30.9T   239M  37.3T  37.9T
   256K:      0      0  30.3T      0      0  30.9T    289  96.4M  37.9T
   512K:      0      0  30.3T      0      0  30.9T      0      0  37.9T
     1M:      0      0  30.3T      0      0  30.9T      0      0  37.9T
     2M:      0      0  30.3T      0      0  30.9T      0      0  37.9T
     4M:      0      0  30.3T      0      0  30.9T      0      0  37.9T
     8M:      0      0  30.3T      0      0  30.9T      0      0  37.9T
    16M:      0      0  30.3T      0      0  30.9T      0      0  37.9T

This is also the breakdown of file sizes in my pool, Logical = number of file actual filesize, Physical = number of files within that blocksize:

Size   Logical Count	Physical Count
0-512	       11234	4282
1K	            4258	0
2K	            8363	0
4K	           14104	0
8K	          187497	40934
16K	         1736694	1034971
32K	          359377	1145562
64K	          308586	355686
128K	      302034	313125
256K	      634766	600854
512K	     1108517	1133355
1M	         1142061	1171683
2M	          749231	764100
4M	          441514	444363
8M	          279276	279415
16M	          151689	151501
>16M	      141638	141008


If this last chart is confusing, using the 8K line as an example:
There are 187497 files whos actual (Logical) file size is between 4KB+1 to 8KB.
There are 40934 files that use 8KB block of Physical space in the pool.

If I am reading everything correctly, going by the Block Size Histogram, I have 37TB+ of data mainly in the 128K recordsize range, so it seems a larger recordsize would be better. But going by the filesize breakdown, I have 2M+ files out of less then 8M total, that use less then the current 128K recordsize. So if I go too large, would I get poor performance and too much wasted space when dealing with those small files? :confused:

You need to think more about this. You might have the wrong preconceptions about what a special vdev requires and involves.

If you’re using ZFS replication, it’s unlikely to make any significant difference in replication times. (You might be thinking of file-based tools, such as rsync.)


Small files that are smaller that the dataset’s recordsize property will be saved as special single-blocks that can fit the entire file’s data within. The block’s size will round up to the next “power of 2”.

If you have a dataset that is set to a recordsize of 16M, a 115-KiB file will be saved as a single 128-KiB block, not a 16-MiB block. (Compression will make this 128-KiB block physically take up only 116-KiB on the disk anyways.)

Any files bigger than the recordsize will be saved as multiple blocks, with each block being the size of the recordsize. Compression will still shrink them, especially the last block that has padding on the end.


Anything from 1M - 4M is an ideal recordsize for saving files that don’t have specific needs, in contrast to databases and VMs. You’ll minimize ZFS overhead, boost large file reads and transfers, and get the best compression efficiency for compressible files.

Currently, my backup server is running OpenMediaVault because a) I use a mix of drive sizes, and b) I don’t need or require block-based backups, file-based is perfectly fine and more then enough. The problem is the backup speeds, mainly where the TrueNAS systems takes a few minutes to go through its filetree/metadata (I do have a Metadata cache only L2ARC SSD), while the OMV backup server takes around 4 hours (and getting longer) to do the same on its end, which I can see is mainly due to the sheer number of files. Since swapping the 16x4TB drives for 5x16TB drives in my TrueNAS box, I now have enough 4TB drives to build a 2nd TrueNAS box as a dedicated backup for the main. I figure, to keep the backup speeds fast, I should use sVDevs on both systems to make reading the metadata/filetree as quick as possible, for both backup purposes and my use. The number of files is only going to continue to grow at a regular basis, so a sVDev is something I would like to employ in my setup.

So does this mean that if I set the recordsize to 16M, ZFS would still make use of block sizes of 4K when needed? What if the filesize (after compression) was 16.1M, or 16.2M? Is that where I would see wasted space, or is ZFS smart enough to use a single 16M block, and however many extra 4K blocks to fit the data of that single file?

Correct. A file that is 4-KiB in size will be comprised of a single 4-KiB block, even before compression.


ZFS would split the file into 2 blocks. The first block would be 16-MiB. The second block would be 16-MiB.

With any inline compression enabled (LZ4 is the default, but you can use your preference), the second block will consume only 200-KiB on disk and in the ARC, since the padding of null bytes at the end will be compressed from 15.8-MiB down to nothing. The final form of the block on the disk only consumes 200-KiB.

The first block will likely consume the full 16-MiB, unless it contains compressible data, in which it will be smaller than 16-MiB.

I advise against using 16-MiB, since it can possibly backfire in performance. You’ll less likely take advantage of parallelization and multithreading. Somewhere from 1M - 4M is the best recordsize for performance, overhead, and storage efficiency.


RAM is king. Increase RAM and adjust the zfs_arc_meta_balance tunable accordingly before committing to a special vdev. You might find that it speeds up directory traversal where you don’t need to add any special vdevs. If it doesn’t help, then you can try using a special vdev.

:warning: Remember that you cannot remove special vdevs if any form of RAIDZ exists on the pool.

If it’s ZFS replication, then the above advice doesn’t really matter, since you’ll be using block-based transfers.

2 Likes

Kinda confusing me there. When ZFS goes to actually write the data to the pool, I understand it will compress it first before writing. What I asked was in regards to after that step, the already compressed data will end up taking more space then the largest block size (recordsize), what happens to that data beyond the 16MB block size that is part of that file? Does it use up an entire second 16MB block, even if the remaining data that needs to be written is only a couple of kB? Or does it only use whatever blocksize it needs to house the remaining data? IE, a 16,128,000 byte file would take up a 16MB block + a 128kB block? Or would it still use two 16MB blocks?


RAM is also limited on the system I’m going to use as the “backup” server to just 16GiB. It’s also not going to be on 24/7. It will turn on to run the backup, and then turn off when complete till the next backup time, so ARC will be empty everytime before the “backup” occurs. I will use whatever is fastest between 2 TrueNAS systems (rsync? Replication? Simple file copy using FreeFileSync that I currently use?), but if this requires the filetree metadata to already be cached in ARC to do a fast comparison, then the only way that will happen is with a sVDev on that old box. As I mentioned, I have a Metadata cache only L2ARC SSD on the main TrueNAS box (64GiB ram in this one), and even with that the system can be slow with displaying and accessing directories and files.

And I looked up zfs_arc_meta_balance, and according to a open issue on github, it appears even with that setting being used in OpenZFS, that metadata still gets evicted in favor of data, so a sVDev would probably be a better choice to keeping metadata in faster storage, as even a Metadata only L2ARC seems to be “slow” at times to me, which (if I’m reading what ZFS displayed for my current setup correctly), is probably because the metadata is about as large as the SSD I have in use as the L2ARC and ARC space combined.

 5.13M   164G   26.7G   95.6G   18.6K    6.15     0.25      L1 Total
  249M  30.8T   30.2T   37.8T    155K    1.02    99.75      L0 Total
  255M  30.9T   30.3T   37.9T    153K    1.02   100.00      Total
 6.45M   180G   31.8G    120G   18.5K    5.65     0.31  Metadata Total

You have a file to be written that is 16.2M in size.

The recordsize is set to 16M.

ZFS constructs a16M block, which will be the first block.

ZFS constructs a16M block, which will be the second block.

Let’s assume the file has no compressible portions at all.

ZFS attempts to compress the first block, which has no effect. The first block is written and committed to disk. It consumes 16M on disk.

ZFS attempts to compress the second block, which it shrinks down to 200K. (This is because the 15.8M of null bytes at the end are compressed into nothing in an instant! Very fast.) The second block is written and committed to disk. It consumes 200K on disk.

The file is comprised of 2 blocks and consumes a total of 16.2M on disk. The same also applies to the ARC.

If you’re using any form of compression, you don’t need to worry about wasted space at the end of a file.


I still advise against using a recordsize of 16M, since you will lose out on parallelization.

If you have files that will be modified in-place, then you will also get “write amplification”. This means that simply changing a few bytes in-place will need to write a new 16M block. This isn’t really as big of a concern as people make it out to be, since most non-database software doesn’t do in-place modifications. They usually create a temporary copy of the file and then rename it to replace the original.


Nothing beats incremental replication. Nothing needs to be read or prepared before being transferred. ZFS will just send all blocks in the range of a sequential “birth TXG”.


If you really want to commit to a special vdev, you need to understand that it’s a one-way commitment. If your pool has any RAIDZ vdevs, you cannot remove a special vdev if you later change your mind. You also need to construct a special vdev with the same redundancy as your other vdevs, since it is considered integral to the pool’s health.

An L2ARC that is persistent and a dataset set to secondarycache=metadata is a safer option.

Remember that all of this is moot if you’re going to use ZFS replications instead of file-based tools.

1 Like

Is the issue still open?

There is a thread I started with, which leads to other threads, and more threads, etc etc, but they all appear to be unresolved/stale. This is the one I started with and kept following various links within those posts: metadata caching does not work as expected - repeatedly getting lost in arc and l2arc with primary-secondarycache=metadata, though alot of it is above my current knowledge of ZFS. All seems to point to metadata still being evicted from ARC, contrary to what various settings are used.

And I already understand the “dangers” of using a sVDev. I’ll be using a 3-way mirror but I need to sort out the recordsize first to ensure all the metadata fits without filling the mirror beyond the advised 70% just like a normal pool.

I believe it’s since been fixed in OpenZFS 2.2+. Much of the complaints were from 2.1 and earlier.

I’ve been quite happy with zfs_arc_meta_balance=2000, which has done a decent job of protecting metadata in the ARC from aggressive eviction. It’s not perfect, and I could increase the value higher, but it’s behaving for the most part.

I recently experienced this. Setting a 16M recordsize on a dataset used for an application that stores data in 16MB chunks seemed like a sane idea. What I failed to consider was the application stored the data in a hash-based content addressable manner, which meant that data which appeared to be stored in proximity might actually be stored anywhere. Reading x data could require reading any number of files at unpredictable offsets.

Because ZFS reads the whole block when it reads a block (in order to checksum it), the performance was abysmal even on a mirror. When I realized this I changed the recordsize to 128K and rewrote all the data. The difference was not trivial. I went from 1 hour to read random samples of data to 1 minute.

3 Likes