SMB performance of all-flash setup (1 Gbits?)

Hi,

in short this is my situation with my prosumer-ish setup of RAIDZ1 on SCALE, 7 wide, 1.82 TiB (all flash, SATA, no dedicated HBA).

  • Writing to SMB share: ~1.1 Gbit/s
  • Reading from SMB share: ~2.4 Gbit/s
  • Interface capability on both ends, connected directly: 2.5 Gbit/s

Expectation: SMB write speed closer to interface capability

Read/write speeds show basically no variation in between tests as well as during tests, looks like a hard cap from start to finish.

iperf with truenas as server, my PC as client:

Accepted connection from 192.168.0.98, port 58595
[  5] local 192.168.0.99 port 5201 connected to 192.168.0.98 port 58596
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   138 MBytes  1.16 Gbits/sec                  
[  5]   1.00-2.00   sec   139 MBytes  1.17 Gbits/sec                  
[  5]   2.00-3.00   sec   139 MBytes  1.17 Gbits/sec                  
[  5]   3.00-4.00   sec   139 MBytes  1.17 Gbits/sec                  
[  5]   4.00-5.00   sec   139 MBytes  1.17 Gbits/sec                  
[  5]   5.00-6.00   sec   139 MBytes  1.17 Gbits/sec                  
[  5]   6.00-7.00   sec   139 MBytes  1.17 Gbits/sec                  
[  5]   7.00-8.00   sec   139 MBytes  1.17 Gbits/sec                  
[  5]   8.00-9.00   sec   139 MBytes  1.17 Gbits/sec                  
[  5]   9.00-10.00  sec   139 MBytes  1.17 Gbits/sec                  
[  5]  10.00-10.02  sec  2.99 MBytes  1.15 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.02  sec  1.36 GBytes  1.17 Gbits/sec                  receiver

iperf with my PC as server, truenas as client:

Connecting to host 192.168.0.98, port 5201
[  5] local 192.168.0.99 port 58808 connected to 192.168.0.98 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   264 MBytes  2.21 Gbits/sec  145    503 KBytes       
[  5]   1.00-2.00   sec   284 MBytes  2.38 Gbits/sec    0    716 KBytes       
[  5]   2.00-3.00   sec   282 MBytes  2.37 Gbits/sec    0    776 KBytes       
[  5]   3.00-4.00   sec   282 MBytes  2.37 Gbits/sec    0    787 KBytes       
[  5]   4.00-5.00   sec   284 MBytes  2.38 Gbits/sec    0    790 KBytes       
[  5]   5.00-6.00   sec   282 MBytes  2.37 Gbits/sec    0    793 KBytes       
[  5]   6.00-7.00   sec   282 MBytes  2.37 Gbits/sec    0    797 KBytes       
[  5]   7.00-8.00   sec   284 MBytes  2.38 Gbits/sec    0    801 KBytes       
[  5]   8.00-9.00   sec   282 MBytes  2.37 Gbits/sec    0    803 KBytes       
[  5]   9.00-10.00  sec   282 MBytes  2.37 Gbits/sec    0    810 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.74 GBytes  2.36 Gbits/sec  145             sender
[  5]   0.00-10.00  sec  2.74 GBytes  2.35 Gbits/sec                  receiver

iperf Done.

This matches my experience when doing regular file transfers.

I ditched any hardware in between server and client for testing, but adding my switch and the serverā€™s Intel X520 with 2x SFP+ via LACP changes nothing (I disconnected all other network cabling from the server).

Copying within server (from boot pool to storage pool for example) seems fine, I guess (probably skewed by CPU compression? anywayā€¦):

copy:

admin@truenas[~]$ time ( cp /home/admin/test /mnt/sata-ssd-01/mix/test3 ; sync )
( cp /home/admin/test /mnt/sata-ssd-01/mix/test3; sync; )  0.00s user 4.24s system 37% cpu 11.162 total
admin@truenas[~]$ ls -lh /home/admin/test
-rw-r--r-- 1 admin admin 9.8G Jun 18 23:48 /home/admin/test

fio (no idea what Iā€™m doing - any help is appreciated):

admin@truenas[/mnt/sata-ssd-01/mix]$ fio --filename=testthrough --direct=1 --rw=randrw --randrepeat=0 --rwmixread=100 --iodepth=128 --numjobs=12 --runtime=60 --group_reporting --name=4ktest --ioengine=psync --size=4G --bs=1MB
4ktest: (g=0): rw=randrw, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=128
...
fio-3.33
Starting 12 processes
4ktest: Laying out IO file (1 file / 4096MiB)
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
Jobs: 12 (f=12): [r(12)][-.-%][r=21.9GiB/s][r=22.5k IOPS][eta 00m:00s]
4ktest: (groupid=0, jobs=12): err= 0: pid=57402: Sat Jun 22 13:54:40 2024
  read: IOPS=22.5k, BW=22.0GiB/s (23.6GB/s)(48.0GiB/2184msec)
    clat (usec): min=66, max=20228, avg=522.09, stdev=381.78
     lat (usec): min=66, max=20228, avg=522.18, stdev=381.78
    clat percentiles (usec):
     |  1.00th=[  212],  5.00th=[  289], 10.00th=[  318], 20.00th=[  334],
     | 30.00th=[  351], 40.00th=[  388], 50.00th=[  594], 60.00th=[  619],
     | 70.00th=[  627], 80.00th=[  635], 90.00th=[  652], 95.00th=[  668],
     | 99.00th=[  807], 99.50th=[ 2409], 99.90th=[ 5997], 99.95th=[ 8291],
     | 99.99th=[ 9634]
   bw (  MiB/s): min=21502, max=23978, per=100.00%, avg=22612.05, stdev=90.23, samples=48
   iops        : min=21501, max=23975, avg=22610.00, stdev=90.19, samples=48
  lat (usec)   : 100=0.05%, 250=3.45%, 500=41.31%, 750=54.01%, 1000=0.35%
  lat (msec)   : 2=0.30%, 4=0.15%, 10=0.39%, 20=0.01%, 50=0.01%
  cpu          : usr=1.03%, sys=93.71%, ctx=1708, majf=4, minf=111
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=49152,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=22.0GiB/s (23.6GB/s), 22.0GiB/s-22.0GiB/s (23.6GB/s-23.6GB/s), io=48.0GiB (51.5GB), run=2184-2184msec

I checked the MS recommendations concerning SMB performance and (excluding SMB multichannel which I can not activate with my client hardware as far as I see) nothing changed with those suggestions.

I wondered if the onboard SATA controller is just not up to the task since I do not use any HBA and there have been worrying reports on Reddit (much higher cap though) but internal copies seem fine.

I wondered if any of the network hardware is just garbage but I think I ruled that out on server side by testing all of them. I do not know what to do about the probably less-than-ideal Realtek stuff on the client side.

It does not look like a caching issue to me - test files have been ~10 GB, free RAM is available, disks should be faster than 1 Gbit/s on all sides, transfer speed is pretty constant.

It does not really look like a network issue to me - reads seem fine.

Questions:

  1. Is my expectation just too unrealistic? Am I wrong in expecting >2Gbit/s write performance? Is this just ZFS being ZFS (i.e. not optimized for performance)?
  2. How can I find out if this is a network/client/server/SMB/ZFS issue?

I am feeling a little bit lost. Any help or directions would be appreciated!

Hardware and TrueNAS config:

Client: 
Intel Core i5-13600K
ASUS TUF GAMING B760M-PLUS D4
2x 16 GB DDR4 RAM
WD Red SN700 4 TB

Server:
TrueNAS Scale Dragonfish-24.04.1.1

Intel Core i5-12500
ASUS Pro WS W680-ACE
2x 32 GB ECC RAM
Samsung SSD 980 PRO 500GB
8x WD_Red_SA500_2.5_2TB

Config:
- Storage:
Data VDEVs
1 x RAIDZ1 | 7 wide | 1.82 TiB
Metadata VDEVs
VDEVs not assigned
Log VDEVs
VDEVs not assigned
Cache VDEVs
VDEVs not assigned
Spare VDEVs
1 x 1.82 TiB
Dedup VDEVs
VDEVs not assigned
- 1 dataset with a child SMB share (Dataset preset: Generic, child dataset preset: SMB, Purpose: default share parameters)
- Network:
--- testing: 2.5 Gbit/s interface of server connected directly to 2.5 Gbit/s interface of client, static IPs
--- production: switch TP-Link SG3210X (2x SFP+ 10 GBit/s, 8x RJ45 2.5 Gbit/s) connected to Intel X520 on server (LACP)

One factor might be the record size of your dataset.

Adjust the record size to match the use case. 1M with Zstd turned on (thank you, @winnielinnie) is a sweet spot for large image files, videos, and archives. The default 128k works better for smaller files, databases, etc.

I had the write speeds on my rust pool jump from about 400MB/s to 800 MB/s for large files on the basis of record size increase combined with a sVDEV upgrade. Youā€™re already at all flash so a sVDEV is unlikely to help and the record size issue may be hindering throughput.

So Iā€™d create a new test data set with a larger record size and see if that helps. Itā€™s buried in the storage ā†’ pool UI menu. Triple dots to the right of your dataset listing in the pool.

iperf bypasses ZFS and the storage devices. So your client is only able (at best) to transfer at 1.16 Gbps to the server.

This matches your ā€œSMB writeā€ observations.

1 Like

Thank you so much for the response!

1M with Zstd turned on (thank you, @winnielinnie) is a sweet spot for large image files, videos, and archives. The default 128k works better for smaller files, databases, etc.

That is good to know!

It did not change anything though which makes sense given what was said about iperf in this topic. I will keep it in mind.

Thank you so much!

I somehow suspected that might be the case (after reading the man page).

So this is probably network related. Since I can rule out the switch and any specific port on the server, I tested with my sturdiest cable (Cat 6a S/FTP according to marketing) with no difference. So the prime suspect right now is the Realtek chip on my client.

My next step would be to get a better NIC and test with that.

Would that make sense? Or do you know of any network configuration on server or client that could show symptoms like this? Especially the difference between read and write leaves me puzzled.

1 Like

There are a lot of folk here who have an allergic reaction when they see ā€˜Realtekā€™ and ā€˜Ethernetā€™ in proximity. I suggest you try something known good like an Intel chipset SFP+ PCIe card or Thunderbolt interface.

2 Likes

I donā€™t think this is true. As I understand it, record size is a max value not a fixed value. So even if you set your record size to 1M and only have 128 files, it will behave the same as setting it to 128k, since all records would be 128k in size.

Yes, but databases that experience heavy ā€œin-placeā€ writes and modifications will suffer ā€œwrite amplificationā€ due to the CoW nature of ZFS.

So you are 100% correct about the files themselves.

Where the recordsize really matters is in the in-place write/modification pattern of the dataset.

A dataset that contains many small files (under 128K in size) that are mostly ā€œwrite-onceā€ will essentially yield the same results whether the recordize is 1M, 512K, or 128K.

But things change depending on the level of in-place modifications.


My opinion: The default recordsize for a ZFS dataset under TrueNAS should be 1-MiB, if we assume that the majority of its use is as a backup storage. (Write once, ready many.)

At least for home users, I cannot imagine that most of them do in-place modifications to files stored on their TrueNAS server. And even if they did, many software applications employ their own ā€œcopy-on-writeā€, rather than use true ā€œin-placeā€ modifications, which works against ZFS.

1 Like

What is the difference of a DB writing to a dataset (is that a thing? Seriously asking) if the dataset is 128k or 1M?

I presume that if one part of the record content is modified that the whole thing has to be re-written, checksummed, metadataā€™d?

Thus, if you have many small files that are constantly changing, it makes sense to crank down the record size to minimize the writing / checksumming / etc. that has to happen as contents of each record change, though at the cost of far more metadata being needed / written, etc.

The much faster write performance Iā€™m enjoying with large files today can be directly traced back to larger record sizes and very fast metadata (due to sVDEV).

2 Likes

DB software can write as small as a 4K page.

Imagine for every 4K modification to the DB file, ZFS has to read (and re-write) a entire 128K (or even 1M) blockā€¦ just to change 4K worth of data.

EDIT: To be clear, none of the above (in regards to DB software) applies to most home users of TrueNAS, who are saving all types of files to be archived and backed up. (In such a case, yes, a 4K text file will be written as a single 4K block, even if the dataset has a recordsize of 1M.)


Going off topic. I think the OPā€™s issue lies in the RealTek onboard NIC. (Or maybe its driver under Linux?)

What is the client OS, @Kiwi ?

2 Likes

Going off topic.

Well I donā€™t mind if nobody else does. :slight_smile: Interesting read.

What is the client OS, @Kiwi ?

Windows 10.

Since all my Linux boxes only offer 1 Gbit/s, I will try a Live Ubuntu on the Windows machine tomorrow just for science.

By now I am inclined to believe this is completely client sided. For some reason recent write tests dropped by around 30 MB/s to just under 1 Gbit/s. As before, very consistent, repeatable and basically without variation during an iperf with default settings or real world copy. After a reboot of the client I am now consistently back to the values in the first post.

I will probably order an Intel X550-T2. (I have read the 10 Gig Networking Primer as well as the concerns therein and still want to try.)

This might not have anything to do with TrueNAS after allā€¦

If the database is 40MB.

The DB updates a rowā€¦ which may be a single byte.

It has to do an in place read/modify/write, or some atomic version, either way it might be operating on 4K writes, but it has to use record size.

So, that results in more COW. Which is fine. And then snapshots snap that change.

Databases are more like block storage, and really if youā€™re hosting a serious database, the dataset should be configured more like you would for block storage. Ie smaller record size, simpler/faster storage.

Or at a minimum, use 128K, unless your files are generally large static files.

1 Like

Yeah, but ā€œthe whole thingā€ is always the same size (8k as an example) no matter if the recordsize is set to 128k or 1M because recordsize is a max value, right?

Yes, but this is block. This only applies to blockstorage or zvol, right?
On a dataset, that 4k DB data will be 4k no matter if the recordsize is 128k or 1M because the recordsize is just a max value, right?

To be honest, I wasnā€™t even aware that some databases support datasets, I always assumed that all DBs are hosted on blockstorage.

I sound like a broken records, but isnā€™t recordsize a max value? :smile:
Maybe I am seriously misinformed :exploding_head:

Record size is the block size before compression.

Yes. Itā€™s the max value. Technically.

Each block in a file will always be the record size except the last block which will be the remainder.

Of course, if the file size is less than the record size then the first block is the last block.

ā€œBlocksā€ are units of storage for datasets as well. The nomenclature can get confusing.

Think of ā€œrecordsizeā€ as the policy for the maximum size that a block can be.

Think of a ā€œblockā€ as the actual data itself as seen in RAM.

Think of the ā€œblock-on-diskā€ as the form (and size) of the block as stored on the physical storage medium.

Some examples:

If your datasetā€™s ā€œrecordsizeā€ policy is 1M, and you save a non-compressible 8K file, then it will be comprised of a single block that is 8K in size. It will be 8K in RAM when used by applications, and 8K stored on disk.

If your datasetā€™s ā€œrecordsizeā€ policy is 1M, and you save a non-compressible 980K file, then it will be comprised of a single block that is 1M (next power-of-two) in size. It will be 1M in RAM when used by applications, and 980K stored on disk (or 1M stored on disk if youā€™re not using any inline compression).

If your datasetā€™s ā€œrecordsizeā€ policy is 1M, and you save a highly-compressible 980K file, then it will be comprised of a single block that is 1M (next power-of-two) in size. It will be 1M in RAM when used by applications, and perhaps 160K stored on disk. (Again, assuming inline compression is enabled.)

If your datasetā€™s ā€œrecordsizeā€ policy is 1M, and you save a non-compressible 4.5M file, then it will be comprised of a fives block that are 1M each in size. All blocks will be1M in RAM when used by applications, and 1M each stored on disk, except for the last block which will only be 512K if youā€™ve enabled any form of inline compression.

For any scenario above, whenever you do a true in-place modification, the entire block (as seen in RAM) must be read and then re-written. Consider the last example: Modifying in-place just a few KB in the middle of the file will require reading and re-writing 1MB worth of data.

Enabling even ZLE inline compression (at minimum) will removing any ā€œpaddingā€ at the end of a fileā€™s last block. (ZLE is insanely fast, and compresses a sequence of zeroā€™s into nothing.) Therefore, the last block of a file might only be a few KB, while the remainder is ā€œpadding of zerosā€ up to the 1M recordsize policy. On the storage disk itself, this block only physically consumes a few KB, since any form of inline compression makes this padding a non-issue.

2 Likes

You sure about this bit?

I thought zfs was smart enough not to bother padding blocks in ram.

Perhaps in the ARC, but I donā€™t believe this is true for the memory that an application uses. I donā€™t think itā€™s even technologically possible (or sensical) to have working data in a compressed format, since it eventually needs to be uncompressed for the application to use it.

A request for the data to be read could be pulled from the ARC (instead of the physical drive), but the application needs the unencrypted / uncompressed data itself to work with.

EDIT: As a thought experiment, take compression out of the picture. Only consider encryption.

Letā€™s say a 1M block of data is encrypted on the disk. Letā€™s even say that after so many reads, it remains in the ARC for subsequent reads by applications. The application cannot work with the encrypted data. It must be decrypted (in RAM) before it can be used, regardless if it was pulled from the physical disk or from the ARC.

I would assume the same applies for compression.

This is true, but also, the record size has not much to do with how the app decides to size its read/write buffer.

I mean, maybe some can work out the underlying block size, but I think apps just tend to work with a fixed buffer size.

Essentially, the file system is interrogated via fread/fwrite locally. Those functions (and their various permutations and friends) take a pointer of a certain size and are asked to read from or write to that pointer, a certain number of bytes.

That may be being called by an iSCSI daemon or an SMB daemon, or an nfs daemonā€¦

OR an application.

BUt essentially the zfs driver will fill that inā€¦ from arcā€¦

1 Like

Maybe @HoneyBadger or Matt Ahrens can chime in? :innocent:

1 Like