Benchmarking different compression/checksum algorithms prior to pool creation?

I’m about to recreate my pool on empty disks, and send/receive the original data across.

But before doing that, I’d like to test CPU time for various compression and checksum options on my specific hardware using a temp pool on the empty disks, rather than picking based on hearsay/guesswork.

That means testing the usual - read vs write, small vs large vs mixed blocks, 0 - 50 - 100% compressible (but not just zeros), and preventing effects of warm ARC/existing structures polluting the results. (CPU has AVX2 but not SHA)

I can’t find a resource to do this, but I’m sure its a question many have asked and some serious enthusiasts and users devote time to.

Are there any scripts or test methods available to an ordinary SCALE user which do this in one neat package, or an already-existing script? What do others do?

Alternatively, if I need to do it from scratch, how should I approach it so I time what I actually want to time, and don’t get misled by ZFS shortcutting things, efficiency pathways, non-algorithm time etc? Is it a huge job?

Instrumentation options are valid

While I’m not looking for extreme solutions, I’m not averse to narrowly instrumenting the R/W pipeline with perf or BPFTrace if that’s honestly in others’ view, the easiest way to measure exactly what Im after (or close to it), either. I had to do that on CORE to cleanly optimise it. too. But I don’t know this system and I’d have to ask outline help what/how to instrument, key lines of code to base it on, and what to look for in the output, if that’s whats needed.

System/pool info

If relevant, the pool was 40 TB full size, 14 TB deduped on CORE, and the system was designed for dedup workload and I/O: 2015 era Xeon/Supermicro, huge RAM (256 GB) with no other workloads/no apps or jails, many cores, fast mirrors, fast SSD special vdevs.
CPU is 8C/16T (or maybe 16C/16T??). Family 6/63 (Haswell/Broadwell-EP), AVX2 but no SHA instructions, 35MB large L3 cache, 3.2 GHz max
Ran OK on CORE 13 U5.3, can only run better with modern parameters/careful tuning/fast dedup/clean install SCALE 25.10.
But it does mean carefully selecting the algorithms for compression, checksum and dedup before first write is critical - more than usually? - to reduce the risk of CPU starvation, and allow efficient use.

AFAIK it really depends on the data on how much/fast you want to store and read the data.

So I can only make assumptions based on my data. I have a dataset for Nextcloud. That data is Word stuff, combined with uncompressable data like dwg and stuff. Shared with normal users, that do normal user things. I would say pretty average, like a Dropbox or OneDrive.
Changing and migrating from LZ4 to ZSTD changed the compression ratio from 1.01 to 1.1

Is that zvol or dataset? For compression, having a larger record size should result in better compression.
On the other hand, I could imagine that a higher record size leads to a little bit less dedup.
According to Veeam, 1MB is a sane default for optimal dedup rate.

On the other hand, if we are talking about only 40TB of data, I would seriously question if dedup is even worth it.

At the same time, since you are already using svdev, using that for dedup might work out well. Unfortunately I have not found much information about that setup, since it seams like everybody abandoned dedup to begin with.

IMHO simplest way would to just setup multiple test datasets and transfer a test amount of data to it.

Its surprisingly hard to do benchmarking properly. Too many caches, shortcuts, and other factors. Its a bit like IT security, if I dont really know the subject in depth, Im probably not competent to whip up a guess how to test it, and absolutely shouldn’t rely on “just write data and time it” :laughing: :laughing: :laughing:

I would break it down into two smaller parts.

A: if it turns out that GZIP or ZSTD does not offer a significant compression improvement over LZ4 in a 1MB recordsize (still not sure if you use datasets or zvols) I would argue it is not worth to begin with.

B: Dedup makes it way more complicated. If you don’t are for some strange reasons really dependent on running your 10 times 2TB drives in a mirror that offer you 20TB of total storage to store your 14TB data, you might even be faster with two 20TB drives in a mirror but without the dedup :grin:

I don’t claim that I can advise you with your setup. I never used dedup in production. I can only tell you that without knowledge about your use case, your data and your hardware, nobody here will be able to help you. Just make guesses like I do.

But then again, check A first. If that turns out not to be true, we can stop there.

The only thing i could think of is

I just did an exercise where I moved each of my datasets from LZ4 to ZSTD and hand tuned the recordsize for each dataset.

I changed the default to be 1M from 128K. For datasets of larger files, like for example videos, I made the recordsize 4M. I went so far as having my “software” dataset have sub datasets for “ISO” vs “drivers” and setting the record to 4M for ISOs. For the TimeMachine share I set the record size to 8M. For VMs I am running on the system I set it to 64K. To be clear, I did a lot of reading about how this works and then did some quick testing to confirm understanding.

Recordsize does affect compression performance.

This is a good tool to report the file sizes:

Overall the system is faster in opening the shares via SMB than it was before. The compression ratio went up on most datasets, but not dramatically; from 1.0-1.1ish to 1.1-1.25ish.

Rsync time from my system to the NAS when down greatly.

I would not use dedup. It is not work the CPU, RAM and IO required.

This is a little bit off topic, but record size is a max value, not a static value like volblocksize. Setting it to 64k instead of 1M does not really help with anything.
You can have a 1MB record size dataset and fill it with only 4k files and the record size will be 4k for each file.

Also VMs normally should run in zvol (with a static volblocksize or blocksize) and RAW. AFAIK the only way to put VMs disks in a dataset is by using QCOW2. Which you should not really do, because that gives you CoW on top of CoW.

Sorry.

The dataset that I created call VM is set to 64K. The VMs are zvols under that.

Adjusting the recordsize for datasets with larger files reduces the number of block which improves performance. There are a number of articles on that. Here are 3.

Slight misunderstanding - probably due to ambiguous terminology. 90 TB of rust configured as 3 way mirrors (plus special vdevs) = data file capacity 30 TB; 40 TB of data deduped down to 14 TB stored on it. Not sure if that helps, but apologies for the ambiguity.

@LarsR - Thanks, I saw TN-Bench. Is it any good for this kind of task?

@sah - This is gold, thanks! Because yes, exactly the same situation, different datasets for different kinds of file - saved VM and system images are huge files, photos, documents, drivers, etc are smaller. I’ll look at that script, too! I hadnt considered different recordsize for different datasets, I was thinking one size for all. But dedups a given, for my use-case. So it’s a case of ensuring the hardware matches the need. The links are good though!

Is sampling with perf a fair and sensible way to test how long actual compression/checksum/hashing takes at a call level, if the pool is deleted and recreated between runs, and the test is substantial enough that sampling approaches averaging?

How detailed do you really want to be on the compression stuff? There are tons of published test and the final TLD is that ZSTD is pretty good and you should just use it as the overhead is small and compression helps with other parts of the system, how things are read, written and cached. There is a very small difference between LZ4 and ZSTD as far as CPU usage goes but ZSTD almost always compresses better. ZSTD came out of Facebook and it used everywhere for them.

Yeah, but zvols come with the default 16 volblocksize and are blockstorage.
This has nothing to do with your dataset or recordsize.

So if you use RAW and zvols, your VMs are using 16k volblocksize. Which is blockstorage and not files with 64k record size.

No worries, although this still does not help that much, since we don’t know how many drives or how many slots you have. The recommendation still stands, don’t make use of dedup. For only 40TB down to 14TB it is not worth it. Make your pool larger instead.

This is another off topic, but don’t put files into VMs!
Like already said, VM make use of blockstorage.
The default volblocksize is 16k.

This has huge downsides in terms of performance and compression chances.
If you offload files from VMs into datasets you get many advantages:

  • VMs are way smaller which makes backups a lot easier
  • You can directly access files on TrueNAS, because it is not in some VM disk
  • You can because of that easily backup the files to something like S3 or another NAS
  • You can use 1M record size for your files which helps tremendously with compression chances, metadata stuff like ARC or svdev

So instead of what you have now (and I am just making wild guesses since there is still a lot of infos missing) I would use:

  • 2TB SSDs in a 3 way mirror for VMs
  • Put the current 90TB drives in a RAIDZ2. Assuming you have 9x 10TB drives, that would result in 70TB usable storage, so no need for dedup.
  • In that pool, create datasets with 1MB record size. Put your files there. Access them over SMB or NFS
  • If you need better metadata read performance, add an L2ARC single SSD.

If your main concern is VMs and not storage, and you don’t like VMs accessing data over SMB or NFS, Proxmox offers Virtio-FS. That way VMs can directly access datasets.

TrueNAS is a great NAS, not so great Hypervisor. Proxmox is a great Hypervisor, not so great NAS. To get the best of both worlds, use both.

Sorry I was not being 100% detailed here. I have a dataset called VM set to 64K for some experimental QCOW stuff that I do not run long term. The VMs I run longer term are zvols under that dataset. I should have been a bit more clear in what I wrote.