Script to set best dataset record size

Protopia · November 11, 2024, 1:14pm

Does anyone have a script that will analyse the size of files in a dataset, calculate the best record size, (optionally) set the dataset record size to that value and (optionally) rebalance any files in the dataset that need their record size changed?

Protopia · November 12, 2024, 9:10pm

So far no replies, so apparently no one has one or has heard of one.

Anyone with better bash skills than me willing to create an analysis script, with a view to hacking it together with the rebalancing script to create the end goal?

neofusion · November 12, 2024, 9:36pm

I’ve seen the first but not the other three parts of the puzzle.

The first was basically just something that counted the files and presented statistics of the file size distribution. You as the user was then left to determine what would be best for you, change the record size and move the data around.

~~It think it was posted in the old forum. I’m unable to be more specific than that.~~

Edit: Maybe it was Level1Techs that had it…

Protopia · November 12, 2024, 11:24pm

After a bit more research I am now VERY confused about record sizes. Here is what man zfsprops says about record sizes:

recordsize: Specifies a suggested block size for files in the file system. This property is designed solely for use with database workloads that access files in fixed-size records. ZFS automatically tunes block sizes according to internal algorithms optimized for typical access patterns. For databases that create very large files but access them in small random chunks, these algorithms may be suboptimal. Specifying a recordsize greater than or equal to the record size of the database can result in significant performance gains. Use of this property for general purpose file systems is strongly discouraged, and may adversely affect performance.

YET… TrueNAS UI requires you to specify a recordsize for every dataset (even if it is inherited from the parent dataset).

So man says don’t specify a recordsize except for transactional databases, and TrueNAS forces you to set one for everything.

Can anyone explain who is right here?

Existing scripts

I found a better page with various scripts and more explanations here: linux - Generate distribution of file sizes from the command prompt - Super User

What does recordsize impact

Recordsize does NOT impact RAIDZ space usage. Space usage (ignoring compression) is the size of the file divided by the ashift blocksize rounded up (the number of data blocks) plus this number divided by the number of non-redundant disks in the vDev and rounded up then multiplied by the redundancy factor. So for example, with a 4K ashift blocksize, a 73KB file is 18.25x4KB, so needs 19 data blocks, and on 6x RAIDZ2 with 4 non-redundant disks and 2 redundant disks, those 19 blocks will need 5 sets of redundancy (4 non-redundant disks) i.e. 10 redundancy blocks for a total of 29 blocks. Recordsize does not seem to be a factor here.
I have a feeling that the checksum is per record, so with larger record sizes you will need to store fewer checksums, but each checksum is typically pretty small compared to the record size (unless you are talking recordsizes measured in bytes rather than tens of KB) and so has little impact on overall space usage.
Instead I think that record size is more about the size of I/Os which impacts performance rather than space usage. And the performance impact is much more important for random I/O and especially random writes to a small part of an existing file, and much less important to sequential reads and writes of entire files which for network shares will likely be the norm (or sequential reads of large chunks for e.g. streaming).
Writes - I have a feeling that ZFS groups writes together anyway in order to maximise write bandwidth, so I am not sure that recordsize will have any appreciable impact on asynchronous write performance (or the deferred writes of synchronous writes) but it might well impact ZIL writes for synchronous writes.
Reads - one stream - Given that the network blocksize is a fraction of the 128KB default recordsize, I would assume that the network response for the first block of a full file read will be better with a small record size and worse with a big one, but after that sequential pre-fetch will kick in and the data will be read into memory in advance of it being needed to send over the network. So providing pre-fetch can keep pace with the network speed, then once the file has started transmitting, the record size is probably less important.
Reads - multiple streams - If larger record sizes means longer individual I/Os, then in a multi-user environment excessively large I/Os may result in large I/O times and non-prefetch I/Os being queued - meaning slower responses for small file requests.

Summary - now I think more deeply about this, for small home servers which typically don’t have competing I/Os from multiple users, I am really unsure that recordsize makes that much of a difference to performance.

Have I got this right, or alternatively have I got completely the wrong idea?

NickF1227 · November 13, 2024, 12:41am

I think you’re reading between the lines here a bit too much. A record size with a value of some kind must exist. The default in ZFS (and TrueNAS) is 128K. So no one is right and no one is wrong. Perhaps the manpage should be worded a bit differently like this?
“Use of non default values for this property for general purpose file systems is strongly discouraged, and may adversely affect performance.”

This is what OpenZFS has to say on the matter:
https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html

Jim Salter has a nice article on Klara’s website here:

Anecdotally, for general purpose file sharing, 1M is a good alternative to 128K. You’ll likely see higher sequential read and write values when copying over large media files, as an example. You also may see better compression ratios for files which are compressible when using larger record sizes.