16 MiB recordsize?!?!?!?! <--- clickbait punctuation marks for the algorithm

winnielinnie · September 11, 2024, 1:10pm

I never noticed this on TrueNAS Core 13.3, until I decided to create a new dataset.

16m-recordsize

I know that this has already been available to ZFS, but now it’s exposed by the GUI?

Is this an oversight, or does iXsystems believe that 16-MiB recordsize is considered tried-and-true to allow for datasets on a TrueNAS server?

For those using SCALE, does the drop-down also make 16M available?

I’ve had phenomenal results with pretty much all of my datasets using 1-MiB recordsize, which was the highest available level (in the GUI) for TrueNAS Core 13.0 and prior.

Is there an obvious caveat I’m missing that would cause performance and efficiency issues by setting it to 16M? Because of inline compression, no files will consume an additional ~16 megabytes due to “padding”. So a 500 KiB file will consume approximately 500 KiB. The same is true for a 2 MiB file, a 10 MiB file, and so on. Nor will a 33 MiB file consume 48 MiB on the storage medium. It will consume 33 MiB, as expected. (First block, 16 MiB. Second block, 16 MiB. Third block, 1 MiB, as the padding of zeroes are compressed to nothing.)

So talk me out of this! Tell me why I shouldn’t set the recordsize to 16M? Trample on my dreams, why don’t you?

Tony-1971 · September 11, 2024, 1:32pm

Yes, also in Dragonfish-24.04.2 there is recordsize of 16M
Not tried (yet)
In this old PDF https://www.usenix.org/system/files/login/articles/login_winter16_09_jude.pdf
seems that only matter if the application write on the dataset using exactly the recordsize used while creating the dataset.
Best Regards,
Antonio

winnielinnie · September 11, 2024, 1:36pm

Sara · September 11, 2024, 2:00pm

Why does a 33MiB file use 48MiB, when the recordsize is 1MiB?
You think because of padding?

winnielinnie · September 11, 2024, 2:05pm

That comment was in regards to theoretically setting the recordsize to 16M.

If you dataset’s recordsize is set to 16M…

A 33 MiB file, with no inline compression, three blocks on the storage medium:
16 + 16 + 16 = 48 MiB

A 33 MiB file, with any inline compression, three blocks on the storage medium:
16 + 16 + 1 = 33 MiB

Sara · September 11, 2024, 2:08pm

I thought that recordsize is only a max value.
Would a it not be
16 + 16 + 1 = 48 MiB
even without compression?
Or does all blocks have to be the same size for any given file?

winnielinnie · September 11, 2024, 2:12pm

If the blocks on the physical disk(s) consume 16 + 16 + 1 megabytes, then all three blocks, which comprise the file, consume a total of 33 megabytes.

EDIT: Even for “uncompressible” files, ZFS’s inline compression (LZE, LZ4, ZSTD) will essentially squish the 15 MiB’s worth of “padded” zeroes into nothing. Hence, the last block will consume 1 MiB, even if the recordsize is set to 16M.

dxun · September 11, 2024, 2:25pm

Sharing my current experience as I am running SCALE 24.04.2 as well and can see the option. In fact, i am using it right now for a brand new Plex media pool.

I believe this thread summarises the pros and cons nicely and jive in with what I am seeing:

the metadata usage will decrease
the (sequential) read speed might slightly ™ increase
the (sequential) write speed seems to be taking a latency hit if dataset/pool is using compression (which probably 99.99% of datasets should be using?)

The last point is an interesting - I think I am observing it right now as I am replicating the 16M dataset between two SCALE machines. You can see the hardware in my signature and this is basically doing the repl between primary and backup storage server over a 10G fibre link on the same subnet (so no routing penalties). Here is the reporting graph from the backup server - note the oscillations in write speed. This becomes much less prominent when doing the repl with 1M record size and my explanation is that this is due to compression latency as it simply takes time to compress the whole block. Interestingly, the ARC is completely unused, and I would expect it to absorb at least some of the writes at higher speed, do the compression in memory and then write at the pool speed. The media dataset compression algorithm is LZ4.

I have observed this copying the 16M dataset between two pools as well (on the primary storage). I was copying between my Plex pool (single 20 TB disk) to the Tank pool (3x2-way mirror of 10+ TB disks) and the write speeds with ZSTD compression were appalling - 40-80 MB/s with high latencies between individual files. The latency between each copy operation was visibly noticeable in mc.

Once my pool replication finishes, I’ll see if I can have more graphs and more tests with and w/o compression.

winnielinnie · September 11, 2024, 3:03pm

I think I know why.

Not at my computer, but will reply later.

Davvo · September 11, 2024, 3:24pm

I did on CORE, works great. Saved me a few hundred GB of space.

RetroG · September 11, 2024, 3:52pm

16MB record-size has been there for quite some time on scale, I wouldn’t recommend it over 1MB unless you have clear benefits to gain from it, ie compression ratio. for bulk storage of uncompressible files (ie video) 1MB is about ideal.

16MB comes with the caveat of quite big objects in kernel memory, which can result in memory pressure/fragmentation issues that will mostly cause ARC to collapse during heavy I/O and the like… but since ARC uses already compressed records if your data is highly compressible this issue should be minimized.

Davvo · September 11, 2024, 4:08pm

I will post some testing once I get my case.

winnielinnie · September 11, 2024, 8:56pm

This is one reason why a 16M recordsize (or anything above 4M) should probably be avoided. It’s still unclear if this poses an issue for modern versions of Linux and/or FreeBSD.

But I think why someone might see a diminishing return (or even degraded performance) with recordsizes above 4M is likely due to a simpler reason: parallel processing

As far as I understand with ZFS, computing hashes and (de)compression is a single-threaded per-block operation. However, ZFS can process multiple blocks in parallel, which essentially gives you the same performance benefits as outright multithreading.

Compare these scenarios.

Scenario A, 1M recordsize:
To read or write a 16 MiB file, you have 16 checksum operations.

Scenario B, 2M recordsize:
To read or write a 16 MiB file, you have 8 checksum operations.

Scenario C, 4M recordsize:
To read or write a 16 MiB file, you have 4 checksum operations.

Scenario D, 16M recordsize:
To read or write a 16 MiB file, you have 1 checksum operation.

The “sweet spot” for modern multi-core CPUs (and/or with “hyperthreading”) may indeed fall somewhere between a 1M and 4M recordsize, so that there are enough parallel operations, per processed file, to leverage the CPU more efficiently.

*The above would also apply to (de)compression and encryption.

If you have a 16 MiB file, it might sound “better” if it’s only comprised of a single 16 MiB block, and hence you set the recordsize to 16M. However, it’s likely better to have this split into 4 or 8 blocks, to leverage your CPU’s multi-core and hyperthreading features, in which it can process the checksums, compression, and/or encryption with better performance.

Someone like @HoneyBadger is going to jump in here and tell me how wrong I am. I can take it. I’m not insecure…

powernap · September 11, 2024, 10:10pm

I did a quick test a month or two ago running 1 MiB vs. 4 MiB with synthetics and found 4 MiB introduced a bit of a performance hit vs. 1 MiB. It may just be a matter of other parts of the stack not being used to it, or some interaction that is antagonistic to caches… didn’t deep dive on it. 4 MiB is interesting because it is the default rsize/wsize for SMB mounts. I also tried bumping bsize from its default 1MiB to 4 MiB at the same time - didn’t see any advantage in my quick testing.

It’s good to see the results of some of y’all’s experiments.

Stux · September 11, 2024, 10:35pm

I also think if you are jumping around a large file… say seeking a video… zfs is going to have to read 16MB in order to verify a checksum before returning a frame… which may be much less than 16MB.

winnielinnie · September 11, 2024, 10:51pm

That’s actually an interesting point!

Though, I think much of it is mitigated with modern video players, such as MPV and VLC, which preload much of the video over the local network, even if you haven’t skipped ahead from the beginning.

So, in theory, if you “seek around” near the current playback position, it will bypass the ZFS read/checksum operations, as the seeking is contained within the client’s RAM. (The operations of “read from ARC or disk and compute checksums” were already done in the background on the server’s end before you tried seeking the video.) Those chunks of 16 megabytes were already pulled from disk/ARC (on the ZFS server), and sent to the client’s RAM. Now it’s the client using its RAM to seek with instantaneous performance.

I’m sure there is nuance, such as the media player, the configurations, and something else I am overlooking.

EDIT: Even so, this isn’t in defense of 16M recordsize. Even if I can get smooth video playback with 16M recordsize, I am still unlikely to set it. Upon further consideration, I’m thinking that 1M is probably the best bet (for reasons mentioned earlier in the thread), and 4M might be the absolute highest I’m willing to even entertain.

Sara · September 12, 2024, 6:21am

That was poorly written on my part.
I will try again.

If I have 16MiB recordsize and write a 33MiB file.
Will it:
A: create two 16MiB blocks and one 1MiB block?
B: create three 16MiB chunks and compress the last block down to roughly 1MiB, because it is mostly only zeros?

Or another way to ask would be if the recordsize is variable for a single file?
We know that it is variable for multiple files, since it is only max value, right?

Davvo · September 12, 2024, 9:05am

Sara · September 12, 2024, 9:56am

Thanks, that is a pretty good summary.
So same size for the file.

winnielinnie · September 12, 2024, 12:36pm

Think of “recordsize” as the policy.

Think of “block size” as honoring that policy for files that are equal to or larger than the recordsize. (There are exceptions for files that are smaller than the recordsize policy: their single block will be sized to the nearest power-of-two block size.)

Then there’s the “block on disk/ARC”, which is where compression can yield not only different physical sizes, but also will vanish the “padding” of zeroes of the last block in a file.

It is the “block on disk/ARC” that can vary. In our above example, the third block is still technically 16 MiB, yet due to inline compression, it only consumes 1 MiB of space on the disk or ARC.

However, when an application needs data from this block, it consumes 16 MiB in non-ARC for the software’s working memory. While the ARC in RAM holds a compressed and encrypted version of this block, the non-ARC working memory for an application cannot understand ZFS encryption or compression.

I’m not sure if there’s more sophistication, such as an application requests “give me the last 1 MiB of the file”: will ZFS temporarily decrypt and decompress this block, only to access and serve the last 1 MiB of the file, and then discard the entire 16 MiB block of mostly “zeroes”, since it’s no longer needed? (Because the application has its own 1 MiB of requested data to work with? There is no longer a full 16 MiB’s worth of data in non-ARC RAM?)