Pool replace/expand and Record Size Change with Decreased Capacity

zmweske · July 9, 2024, 1:44pm

I recently replaced an 8x 1TB vdev with 8x 8TB. The dataset assigned to this vdev was instantiated with 128k record sizes, but I changed it before replacing the drives to 1M. When using the TrueNAS Raid Calculator online, my 8x 8TB with a record size of 1M should give me 2 extra TiB than I currently show. Is there anything I can do about this without starting completely from scratch? Even then, if I do start from scratch, is there a way to define record size when creating a VDEV from the GUI? I cannot seem to find one and it seems to be tied to the dataset.

Based on the this post, is this just a GUI estimate for capacity based on the default 128KiB?

My current expanded pool

Expected capacity with 8x 8TB (7.27TiB) and the default 128KiB record size

Expected capacity with 8x 8TB (7.27TiB) and the 1MiB record size

OS Version:TrueNAS-SCALE-24.04.1.1
Product:Standard PC (i440FX + PIIX, 1996)
Model:QEMU Virtual CPU version 2.5+
Memory:6 GiB

Constantin · July 9, 2024, 2:28pm

I wonder if you need to turn on zstd compression, followed by running a rebalancing script? Is the data small files, databases, virtual machines, etc. or images, movies, and the like? Large record sizes are only beneficial for large files.

So could you verify in the Datasets GUI that the recordsize is what you expect it to be and that compression has been turned on? Thereafter, I’d run the rebalancing script, see here.

zmweske · July 9, 2024, 6:02pm

Most of the storage is linux isos and similar media, but it is littered with small metadata files as well. Additionally, it is used for Proxmox Backup Server, so larger files there as well.

It looks like the new record size is effective for new files based on the histogram shown in the linked article, though it would probably be best to run that rebalancing script too.

Thanks for the pointer!

Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   123K  61.6M  61.6M  3.82K  1.91M  1.91M      0      0      0
     1K:  23.4K  24.3M  85.9M  23.8K  24.5M  26.4M      0      0      0
     2K:  45.6K   155M   240M  4.34K  11.7M  38.2M      0      0      0
     4K:   185K   744M   985M  5.76K  28.8M  67.0M      0      0      0
     8K:   171K  1.94G  2.90G  3.22K  34.8M   102M  30.1K   361M   361M
    16K:  42.1K   926M  3.80G   173K  2.72G  2.82G   371K  8.69G  9.04G
    32K:   102K  4.73G  8.53G   296K  9.31G  12.1G   228K  10.7G  19.8G
    64K:   349K  31.4G  39.9G  5.38K   507M  12.6G   209K  20.6G  40.4G
   128K:  37.3M  4.66T  4.70T  37.8M  4.72T  4.73T  37.4M  6.58T  6.61T
   256K:  15.0K  5.54G  4.70T  5.91K  2.17G  4.73T  13.2K  4.85G  6.62T
   512K:  22.7K  16.3G  4.72T  11.1K  8.36G  4.74T  20.5K  15.3G  6.63T
     1M:   190K   190G  4.90T   224K   224G  4.96T   200K   265G  6.89T
     2M:      0      0  4.90T      0      0  4.96T      0      0  6.89T
     4M:      0      0  4.90T      0      0  4.96T      0      0  6.89T
     8M:      0      0  4.90T      0      0  4.96T      0      0  6.89T
    16M:      0      0  4.90T      0      0  4.96T      0      0  6.89T

SmallBarky · July 9, 2024, 6:08pm

You can also consider making different datasets for the different data sizes for more optimization instead of placing everything in a single sized data ‘bucket’.

zmweske · July 9, 2024, 8:07pm

Most of the metadata files are intermingled with the larger files, but creating additional unique datasets for other uses in the pool is probably smart regardless.

Doesn’t record size only dictate the maximum record size though? My research concluded that as long as you’re not running a DB with a specific record size requirement/optimization, bigger is better, especially if you have a lot of large files. I don’t know if I could find it again but I also thought that smaller files would utilize smaller records such that ZFS doesn’t waste space with small files with large record sizes.

PhilD13 · July 9, 2024, 9:17pm

Larger blocks can improve the efficiency of read and write operations for large files, Large block size can also lead to increased fragmentation and wasted space if there are many small files.
Linux Journal has a decent article that came out recently on Linux file systems and block sizes and data structures.
https://www.linuxjournal.com/content/understanding-linux-filesystems-inodes-block-sizes-and-data-structures

winnielinnie · July 9, 2024, 9:23pm

Correct.

I try to demystify it here.

zmweske · July 10, 2024, 2:03pm

Reply to @PhilD13

I could be wrong, but I believe that block size and record size are different. Block size can be defined on creation (with ashift?) and record size is set by the dataset and can be updated to affect future files. Some research indicates that the two terms are used interchangeably but I don’t know if that is done correctly most of the time.

https://www.usenix.org/system/files/login/articles/login_winter16_09_jude.pdf describes that “Writing a 16 KB file should take up only 16 KB of space (plus metadata and redundancy space), not waste an entire 128 KB record.”

Regardless, this is out of the scope of the question- I am wondering why the GUI is showing a lower total capacity than expected. Unless this is just an estimate using the default 128KiB record size, I want to make sure I set everything up correctly.

zmweske · July 10, 2024, 2:07pm

Reply to @winnielinnie
I am correct in my understanding of record sizes or the fact that the GUI only shows an estimate of total capacity based on the default 128KiB record size?

Also thanks for the writeup, really insightful and solidifies my understanding of record sizes and how it affects datasets. You should consider putting that info into another post under the ‘Resources’ section of the forum for other people to reference!

winnielinnie · July 10, 2024, 2:16pm

I’m not sure how the GUI and/or ZFS estimates come up with this number for a pool comprised of RAIDZ vdevs.

You have to remember: A pool can be comprised of a mixture of different vdevs (mirror, RAIDZ1, RAIDZ2, RAIDZ3), and furthermore, datasets in the pool can have different recordsize policies.

So how could the GUI / middleware / ZFS accurately land on the same numbers as the RAIDZ calculator? There are too many variables that can dynamically change throughout the life of a pool.

You may be correct that it just assumes 128-KiB recordsize for everything.

zmweske · July 10, 2024, 2:50pm

That’s an excellent point, thanks. I’ll just assume that for now.