Unexpectedly large downside, small upside to bs=1M?

Dean_Wilson · August 13, 2024, 3:40pm

I’m in the process of setting up TrueNAS for multipurpose use. The majority of the content will be media, though I’ll also host SMB (with various/unknown file sizes), several jails, apps, and potentially a VM on the same system. I know that SSDs are preferred for fast IOPS with apps and VMs but I’m already way over budget and will have to make do until I can create a different SSD pool at a later date. As a stop-gap, I got 3 SSD drives to use for metadata, setting up a “fusion” pool to try to increase the IOPs for smaller files.

CPU: Xeon E-2334
OS: NVMe
RAM: 64GB
Pool:

Data: (5) Exos 20TB in RaidZ2 (53.49 TiB)
Metadata: (3) Samsung 870 EVO in 3-wide mirror

I’ve read online that VMs, Apps, and databases should have bs=128k, while SMB and Media should have bs=1M, which will improve performance for large files at a slight cost to space (which can be partially negated through file compression). Since I’m setting up a fusion pool with special metadata, my expectation was that I would configure various datasets as follows:

Use case	Block Size	Special Metadata
VMS, Apps	bs=128k	bs=64k
SMB, Media	bs=1M	bs=512k

Given that this system will have both large and small files, I wanted to use fio to test different configurations to compare performance. I ran four sets of fio tests in read/write combinations of both bs=128k and bs=1M

# Test random reads of smaller files (bs=128k)
fio --name=random-read128k --ioengine=posixaio --rw=randread --bs=128k --size=16g --numjobs=4 --iodepth=8 --runtime=60 --time_based --end_fsync=1

# Test random reads of larger files (bs=1M)
fio --name=random-read1M --ioengine=posixaio --rw=randread --bs=1M --size=16g --numjobs=4 --iodepth=8 --runtime=60 --time_based --end_fsync=1

# Test random writes of smaller files (bs=128k)
fio --name=random-write128 --ioengine=posixaio --rw=randwrite --bs=128k --size=16g --numjobs=4 --iodepth=8 --runtime=60 --time_based --end_fsync=1

# Test random writes of larger files (bs=1M)
fio --name=random-write1M --ioengine=posixaio --rw=randwrite --bs=1M --size=16g --numjobs=4 --iodepth=8 --runtime=60 --time_based --end_fsync=1

While I did see small performance improvements between bs=128k and bs=1M, I was surprised by the huge performance decrease for small files with bs=1M. Results are below, with best speeds in bold and second best italicized.

DataSet Block Size	Metadata	Special Small Block Size	fio randread bs=128k	fio randread bs=1M	fio randwrite bs=128k	fio randwrite bs=1M
128k	None	N/A	6085MiB/s	5992MiB/s	602MiB/s	558MiB/s
128k	Yes	0k	8794MiB/s	8349MiB/s	575MiB/s	557MiB/s
128k	Yes	64k	9791MiB/s	8347MiB/s	565MiB/s	573MiB/s
1M	None	N/A	1768MiB/s	9025MiB/s	253MiB/s	633MiB/s
1M	Yes	128k	1973MiB/s	8876MiB/s	291MiB/s	659MiB/s
1M	Yes	512k	1994MiB/s	9050MiB/s	249MiB/s	653MiB/s

It’s possible that I need to tweak my tests to get a more accurate picture of how the different pools will perform with different file sizes. But it looks to me like I’d be better off setting all datasets to have bs=128k, with special metadata bs=64k. This has a minor performance drop for large files, but a significant improvement for smaller files. (The media dataset is mostly video, but will have some small files (photos, video metadata, etc) as well)

Does anyone see anything I’ve done wrong (bad fio tests, or misreading data)? Any other suggestions?

Thanks for your help!
~Dean

Stux · August 13, 2024, 6:43pm

Media would normally be read sequentially.

If you’re reading 128KB from a random point using 1MB bs then you have to read a random 1MB to extract the 128KB (or two 1MB blocks)

Also, when reading a random 1MB you actually have to read 2MB to get the first and 2nd halves.

This is going to slow down your overall read speed since you’ll be reading much more from disk.

So, if your use case is randomly reading frames from your media, then yeah, maybe don’t use 1MB block sizes.

Alternatively, try some streaming copies to benchmark your media speed.

Dean_Wilson · August 13, 2024, 9:56pm

Thanks @Stux - that’s exactly what I needed! I re-ran the tests with sequential reads/writes for both 128k/64k and 1M/512k and saw the speed improvements I was expecting!

It’s good to feel more confident in the dataset configuration(s) before I start transferring data!