16 MiB recordsize?!?!?!?! <--- clickbait punctuation marks for the algorithm

While that’s a good general guide of what recordsize to use, based on the types of files stored on the dataset, it doesn’t mention the performance impact of multi-core CPUs and hyperthreading.

I think this gets overlooked when considering the recordsize in ZFS.

You want to find that “sweet spot”[1] between…

:heavy_check_mark: fewer metadata operations (and cruft), where larger blocks yield better results

and…

:heavy_check_mark: more de(compression) and de(encryption) processes that can leverage your CPU cores and threads, where smaller blocks promote more parallel processing

Too far in either direction, you start to lose the benefit of the other.

If your blocks are too small, you store an excessive amount of metadata, even for large files, and it might become a bottleneck for large, sequential I/O operations; in spite of processing the compression and encryption in parallel.

If your blocks are too big, you might lose out on leveraging the most out of your CPU, in which compression and encryption operations are not parallelized to the same degree that you would get with smaller blocks.

It’s all theoretical. Not sure how much of it translates into real-world experiences. I can vouch for 1-MiB recordsize. It’s given me the best results, and I’ve seen boosts in my sequential speeds when compared to the “default” 128-KiB recordsize.


  1. Ironically, prior to exposing (and supporting) 16M recordsize, 1M used to be the “high end”. Now it’s back in the “middle”. :smile: Maybe we can consider 1M to be the new “128K”? That 1M should be the new default for TrueNAS datasets, and the user can decide if they want to adjust it higher or lower, based on their use-case, types of data, databases, le etc? ↩︎

3 Likes

Depending on what you want, it might not be a disadvantage. Point is, as with all powerful tools (and ZFS is one)… power is nothing without control[1].


  1. aka knowledge. ↩︎

1 Like

That was my biggest reason for trying out 16 MB record size - to see if seeking a video might be more responsive as I find it a bit slow on Plex. My thinking was that I might incur less of disk reads if I have ZFS read 16 MB at once, thus the nearby seeks would be done directly from ARC instead of having to go the scenic route.

It did exactly diddly squat - on the same 40 G movie, 10/20/30 second fast forwards were equally slow on 1 and 16 MB record sizes. I don’t know if that is inherent slowness due to Plex client, my TV is too slow or that the genuine bottleneck is on the storage side, but that’s what I observed.

I’ve since landed on 2 M record size for my media storage, as the acceptable tradeoff for my media dataset. It halves the checksum ops and metadata counts, whilst adding little added memory pressure and latency. Anything more, and we’re truly running into diminishing returns.

EDIT: One thing I forgot to mention - my disk scrub times seem to have noticeably improved. With 16 MB record size, full 10 TB disk scrubs had taken 17+ hours. With 2 MB record size, same disk scrub takes 3+ hours less. I am not sure how to explain this.

image

3 Likes

If you tried it using MPV from a client computer over SMB or NFS, I’m sure you’ll find it more responsive than the Plex player / TV app.


Maybe related to this? (More parallel processing for the CPU yields better performance?)

EDIT: Wait. Changing the dataset’s recordsize does not change the blocks (and their sizes) of existing files.

@dxun Did not reuse existing files, but instead recreated the setup each time with a new recordsize upon creation.

That is certainly the case - no problems seeking with something like a MPC-HC from a desktop. But the Plex app on TV seems to be rather slow. I get faster seeks on Amazon Prime on Disney than locally which is a bit annoying.

That may be a good explanation, I missed that. Spot-checking, I did not see any difference in CPU consumption in either scrub runs, though. Maybe the CPUs are so underutilised by the operation that we’re actually hitting the disk bottleneck as the disk isn’t capable of retrieving that many 16 MB blocks so quickly as to saturate the cores?

Also, I recreated the pool after each record size change and testing so that I am sure the files were created with the new record size.

1 Like

Quite possibly. Maybe a combination of reasons discussed in this thread.

It looks like something from 1M to 4M recordsize is the ideal for performance and efficiency, even for large media files.

I read somewhere (from a user or developer?) that there’s a dropoff after 4M, where you hit a plateau or a diminishing return.

I’m happy with 1M. Have been ever since I set it a while ago. Not even sure if it’ll set any datasets to 4M.

Whoever started this thread made a bunch of hoopla over nothing, as if 16-MiB recordsize is some sort of miracle. :roll_eyes:

1 Like

I think it’s a very worthwhile discussion - at least we’ve hopefully prevented other people going the same path some of us had gone.

That being said - what would be the use case for such extreme recordsizes?

Absolutely not his style… :hand_with_index_finger_and_thumb_crossed:

Very large files.

compressible files would be a good use as it gives the compression algorithm more to work with at a time. and it doesn’t cost you much CPU unlike switching to higher ZSTD levels. so for something like database backups I can see it being extremely lucrative.

1 Like

I already reported him to the mods.


I’m not so sure about this either, because of the reasons mentioned above in this thread: (1) doesn’t leverage as much parallel processing for hashes, encryption, and compression; (2) memory pressure against the ARC from holding big objects in kernel memory; and (3) inefficient inplace modifications, which can be a big deal for something like rsync backups.

That’s why I believe something from 1 to 4 MiB is the ideal recordsize for general purpose, regardless of the sizes of individual files. Media, archives, documents, etc.

Beyond 4-MiB, you might enter into diminishing returns where “less (ZFS) metadata per file and larger sequential chunks of data” do not surpass the benefits that are listed above, and in fact, would end up harming performance without much benefit of the larger data blocks.

Having a ludicrous setting can sometimes help in finding the optimal setting :wink:

1 Like

It looks like the use of huge record sizes (let’s call anything > 2 MB a huge record size) is then limited to a subset of archival purposes where the files stored benefits from a huge record size by giving the user:

  • a higher compression ratio without using a more taxing algorithm (say, anything above ZSTD+5?)
  • less metadata occupancy

but at the expense of:

  • lesser parallelism during (de-)compression which produces greater latencies during file ops
  • higher kernel/ARC memory pressure

This limits the use of such huge record sizes for:

  • well-compressible backups (e.g. database)
  • large, uncompressed text files (e.g. LLM model data?)

For the majority of the purposes, it is not beneficial to go north of 1-2 MB record size - certainly not for most home user or media storage purposes.
Does that sound reasonable?

2 Likes

I would love to see some testing on that.
My gut feeling says that 16MiB is great for movies because:

  • less metadata is great
  • less parallelism does not matter
  • higher kernel/ARC pressure is not a thing when only reading 1-5 files at a time
1 Like

Nemo auditur propriam turpitudinem allegans.

So would I. But what would be an appropriate benchmark?

Hmmmm… 5 concurrent streams that read a file?
Best thing I can offer is running arc_summary, copy 50 10GB files from TrueNAS concurrently and compare that with arc_summary again?

Unfortunately I am not skilled enough to know where to look for the kernel/ARC pressure theory.

I feel like writes would benefit the most from such a big recordsize, but I might be wrong.

Not sure, myself.

We already have an example with @dxun that there’s a notable decrease in scrub performance with 16M recordsize (assuming full 16M blocks) vs 2M recordsize.

This hints that for sequential reads, 16M recordize can indeed backfire, doing more harm than good.

Consider that the following are processed a single-thread for every block that is read:

  • Generating a hash (to check its integrity)
  • Unencryption (if the block is encrypted)
  • Decompression (if the block contains any form of compression)

Doing this in parallel with 4, 8, or 16 threads (against 16 MiB of data) is theoretically superior to doing this in a single thread against the same 16 MiB of data.


I think this is the best summary of why one might consider 16M recordsize.

You would also have to pair the above with a higher level compression setting, such as ZSTD-9+.