16 MiB recordsize?!?!?!?! <--- clickbait punctuation marks for the algorithm

winnielinnie · September 12, 2024, 2:58pm

While that’s a good general guide of what recordsize to use, based on the types of files stored on the dataset, it doesn’t mention the performance impact of multi-core CPUs and hyperthreading.

I think this gets overlooked when considering the recordsize in ZFS.

You want to find that “sweet spot”^[1] between…

fewer metadata operations (and cruft), where larger blocks yield better results

and…

more de(compression) and de(encryption) processes that can leverage your CPU cores and threads, where smaller blocks promote more parallel processing

Too far in either direction, you start to lose the benefit of the other.

If your blocks are too small, you store an excessive amount of metadata, even for large files, and it might become a bottleneck for large, sequential I/O operations; in spite of processing the compression and encryption in parallel.

If your blocks are too big, you might lose out on leveraging the most out of your CPU, in which compression and encryption operations are not parallelized to the same degree that you would get with smaller blocks.

It’s all theoretical. Not sure how much of it translates into real-world experiences. I can vouch for 1-MiB recordsize. It’s given me the best results, and I’ve seen boosts in my sequential speeds when compared to the “default” 128-KiB recordsize.

Ironically, prior to exposing (and supporting) 16M recordsize, 1M used to be the “high end”. Now it’s back in the “middle”. Maybe we can consider 1M to be the new “128K”? That 1M should be the new default for TrueNAS datasets, and the user can decide if they want to adjust it higher or lower, based on their use-case, types of data, databases, le etc? ↩︎

Davvo · September 12, 2024, 7:19pm

Depending on what you want, it might not be a disadvantage. Point is, as with all powerful tools (and ZFS is one)… power is nothing without control^[1].

aka knowledge. ↩︎

dxun · September 14, 2024, 3:24pm

That was my biggest reason for trying out 16 MB record size - to see if seeking a video might be more responsive as I find it a bit slow on Plex. My thinking was that I might incur less of disk reads if I have ZFS read 16 MB at once, thus the nearby seeks would be done directly from ARC instead of having to go the scenic route.

It did exactly diddly squat - on the same 40 G movie, 10/20/30 second fast forwards were equally slow on 1 and 16 MB record sizes. I don’t know if that is inherent slowness due to Plex client, my TV is too slow or that the genuine bottleneck is on the storage side, but that’s what I observed.

I’ve since landed on 2 M record size for my media storage, as the acceptable tradeoff for my media dataset. It halves the checksum ops and metadata counts, whilst adding little added memory pressure and latency. Anything more, and we’re truly running into diminishing returns.

EDIT: One thing I forgot to mention - my disk scrub times seem to have noticeably improved. With 16 MB record size, full 10 TB disk scrubs had taken 17+ hours. With 2 MB record size, same disk scrub takes 3+ hours less. I am not sure how to explain this.

winnielinnie · September 14, 2024, 3:35pm

If you tried it using MPV from a client computer over SMB or NFS, I’m sure you’ll find it more responsive than the Plex player / TV app.

Maybe related to this? (More parallel processing for the CPU yields better performance?)

~~EDIT: Wait. Changing the dataset’s recordsize does not change the blocks (and their sizes) of existing files.~~

@dxun Did not reuse existing files, but instead recreated the setup each time with a new recordsize upon creation.

dxun · September 14, 2024, 4:13pm

That is certainly the case - no problems seeking with something like a MPC-HC from a desktop. But the Plex app on TV seems to be rather slow. I get faster seeks on Amazon Prime on Disney than locally which is a bit annoying.

That may be a good explanation, I missed that. Spot-checking, I did not see any difference in CPU consumption in either scrub runs, though. Maybe the CPUs are so underutilised by the operation that we’re actually hitting the disk bottleneck as the disk isn’t capable of retrieving that many 16 MB blocks so quickly as to saturate the cores?

Also, I recreated the pool after each record size change and testing so that I am sure the files were created with the new record size.

winnielinnie · September 14, 2024, 4:28pm

Quite possibly. Maybe a combination of reasons discussed in this thread.

It looks like something from 1M to 4M recordsize is the ideal for performance and efficiency, even for large media files.

I read somewhere (from a user or developer?) that there’s a dropoff after 4M, where you hit a plateau or a diminishing return.

I’m happy with 1M. Have been ever since I set it a while ago. Not even sure if it’ll set any datasets to 4M.

Whoever started this thread made a bunch of hoopla over nothing, as if 16-MiB recordsize is some sort of miracle.

dxun · September 15, 2024, 4:00am

I think it’s a very worthwhile discussion - at least we’ve hopefully prevented other people going the same path some of us had gone.

That being said - what would be the use case for such extreme recordsizes?

etorix · September 15, 2024, 6:53am

Absolutely not his style…

Very large files.

RetroG · September 15, 2024, 1:39pm

compressible files would be a good use as it gives the compression algorithm more to work with at a time. and it doesn’t cost you much CPU unlike switching to higher ZSTD levels. so for something like database backups I can see it being extremely lucrative.

winnielinnie · September 16, 2024, 12:03am

I already reported him to the mods.

I’m not so sure about this either, because of the reasons mentioned above in this thread: (1) doesn’t leverage as much parallel processing for hashes, encryption, and compression; (2) memory pressure against the ARC from holding big objects in kernel memory; and (3) inefficient inplace modifications, which can be a big deal for something like rsync backups.

That’s why I believe something from 1 to 4 MiB is the ideal recordsize for general purpose, regardless of the sizes of individual files. Media, archives, documents, etc.

Beyond 4-MiB, you might enter into diminishing returns where “less (ZFS) metadata per file and larger sequential chunks of data” do not surpass the benefits that are listed above, and in fact, would end up harming performance without much benefit of the larger data blocks.

Stux · September 16, 2024, 12:07am

Having a ludicrous setting can sometimes help in finding the optimal setting

dxun · September 16, 2024, 2:06am

It looks like the use of huge record sizes (let’s call anything > 2 MB a huge record size) is then limited to a subset of archival purposes where the files stored benefits from a huge record size by giving the user:

a higher compression ratio without using a more taxing algorithm (say, anything above ZSTD+5?)
less metadata occupancy

but at the expense of:

lesser parallelism during (de-)compression which produces greater latencies during file ops
higher kernel/ARC memory pressure

This limits the use of such huge record sizes for:

well-compressible backups (e.g. database)
large, uncompressed text files (e.g. LLM model data?)

For the majority of the purposes, it is not beneficial to go north of 1-2 MB record size - certainly not for most home user or media storage purposes.
Does that sound reasonable?

Sara · September 16, 2024, 5:51am

I would love to see some testing on that.
My gut feeling says that 16MiB is great for movies because:

less metadata is great
less parallelism does not matter
higher kernel/ARC pressure is not a thing when only reading 1-5 files at a time

etorix · September 16, 2024, 8:04am

Nemo auditur propriam turpitudinem allegans.

So would I. But what would be an appropriate benchmark?

Sara · September 16, 2024, 8:43am

Hmmmm… 5 concurrent streams that read a file?
Best thing I can offer is running arc_summary, copy 50 10GB files from TrueNAS concurrently and compare that with arc_summary again?

Unfortunately I am not skilled enough to know where to look for the kernel/ARC pressure theory.

Davvo · September 16, 2024, 10:10am

I feel like writes would benefit the most from such a big recordsize, but I might be wrong.

winnielinnie · September 16, 2024, 3:02pm

Not sure, myself.

We already have an example with @dxun that there’s a notable decrease in scrub performance with 16M recordsize (assuming full 16M blocks) vs 2M recordsize.

This hints that for sequential reads, 16M recordize can indeed backfire, doing more harm than good.

Consider that the following are processed as single-thread for every block that is read:

Generating a hash (to check its integrity)
Decryption (if the block is encrypted)
Decompression (if the block contains any form of compression)

Doing this in parallel with 4, 8, or 16 threads (against 16 MiB of data) is theoretically superior to doing this in a single thread against the same 16 MiB of data.

I think this is the best summary of why one might consider 16M recordsize.

You would also have to pair the above with a higher level compression setting, such as ZSTD-9+.

richardm · January 21, 2025, 6:19pm

One element missing from the above conversation… The end game for fragmentation is recordsize chunks of contiguous data. In theory a pool with 16M records and little else should maintain better data locality over time.

As for speed, I’ve benched seq. reads at 2M and 16M from a small mirror vdev pool of HDDs and cannot find any difference. Perhaps with SSDs the [theoretically] better parallelism of 1M (or 2M) records could be recognized. With these HDDs there’s so much down-time during disk seeks and “blown” platter revolutions where stride is lost that the CPUs have plenty of time to catch up.

Indeed I’ve benched various CPU core counts with OpenZFS and the only scenario where adding cores made any real difference was the zvol + SCST combination. On ctld with a file-backed target I dropped back to two cores. No idea where Samba or NFS would fit into this picture.