ZFS Replication to another TrueNAS box uses just ONE stream!?!

stk · September 22, 2024, 8:56pm

I’m replicating my TrueNAS to TrueNAS running at ServaRica in Canada.

I’m getting an average of well under 20MB/sec or so throughput because the Replication task uses a single stream. So it takes over a day per TB to backup, even though I’m paying for many times that bandwidth. There doesn’t appear any option to increase the number of streams.

This seems like a huge lost opportunity for some impressive performance gains, especially if you have to restore from an offset backup quickly: you REALLY want to get a lot of streams going in parallel.

Why is there no option to increase the number of streams or specify a bandwidth cap? I’m backing up many datasets so there should be no reason that the restore can’t be one stream per dataset, for example.

Is this just a priority feature issue, or am I missing something?

I didn’t see a control for # of streams in the configuration.

Also, there isn’t a control for conflict resolution either. If I didn’t set read only and modify the backup dataset, does it keep the change or overwrite it or ask? There doesn’t seem to be an option to set the behavior to: keep, overwrite, ask. So what does it do?

dan · September 22, 2024, 9:45pm

It overwrites it. The result of replication is that the destination dataset is an exact duplicate of the source dataset as of the time of the snapshot that was replicated. It doesn’t merge, and it doesn’t ask.

As to your main question, I don’t believe it’s possible to send ZFS replication over multiple streams simultaneously. That isn’t a limit of TrueNAS; it’s a limit of ZFS. As I understand it, at least.

Stux · September 22, 2024, 10:20pm

What is the downlink speed on your serverica instance? Have you done an iperf3 test?

Are you using the ssh+netcat option? It may be faster.

Looking at their offering they offer 100mbps and 1gbit (with a traffic limit)

dan · September 22, 2024, 10:23pm

Surely SSH wouldn’t be limiting the speed that much.

Stux · September 22, 2024, 10:28pm

Yeah. Long shot , but it is an easy thing to test, and it certainly makes a difference to my backup system which doesn’t support AES-NI (but the VPN does), difference between 20MB/s and gigabit.

I find SSH tends to saturate a tcp connection given enough data, and decent AES support.

The iperf tests would be the most interesting thing.

Should be plenty of data for the tcp stream to open its windows to get good speed.

stk · September 23, 2024, 1:39am

Yes, I use the netcat option.

I use iperf3 on both my truenas servers (main and backup).

The limitation is simply how fast you can put bits down a single TCP/IP stream between remote sites. That limit is around 20MB/sec if the sites are far away (like in another country). The NUMBER of streams is the limiting factor here. iperf3 proves that. So does wget. So does scp. All the same speeds betwen sites.

So when you do a ZFS replicate to another site, isn’t each snapshot done independently in ANY order? So each snap could be sent in a separate stream. This will speed things up by a factor of 5 or more. This is quite huge when you are restoring data and need it fast.

My question is was there some reason they use ONE stream and serially send the snapshots vs. sending all the snaps in parallel?

It just seems ix left a lot on the table here in terms of potential speed.

here are the iperf3 between my sites for one stream vs. 5 streams. Using -R gave same numbers. The speed is quite variable every second.

truenas_admin@backup[~]$ iperf3 -c service.skirsch.com -R
Connecting to host service.skirsch.com, port 5201
Reverse mode, remote host service.skirsch.com is sending
[  5] local 162.250.191.179 port 46450 connected to 157.22.68.178 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  24.4 MBytes   204 Mbits/sec                  
[  5]   1.00-2.00   sec  31.9 MBytes   267 Mbits/sec                  
[  5]   2.00-3.00   sec  27.5 MBytes   230 Mbits/sec                  
[  5]   3.00-4.00   sec  21.7 MBytes   182 Mbits/sec                  
[  5]   4.00-5.00   sec  20.4 MBytes   171 Mbits/sec                  
[  5]   5.00-6.00   sec  15.9 MBytes   134 Mbits/sec                  
^C[  5]   6.00-6.98   sec  15.5 MBytes   134 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-6.98   sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-6.98   sec   157 MBytes   189 Mbits/sec                  receiver

Now with 10 threads:

truenas_admin@backup[~]$ iperf3 -c service.skirsch.com -P10
Connecting to host service.skirsch.com, port 5201
[  5] local 162.250.191.179 port 57348 connected to 157.22.68.178 port 5201
[  7] local 162.250.191.179 port 57364 connected to 157.22.68.178 port 5201
[  9] local 162.250.191.179 port 57366 connected to 157.22.68.178 port 5201
[ 11] local 162.250.191.179 port 57372 connected to 157.22.68.178 port 5201
[ 13] local 162.250.191.179 port 57382 connected to 157.22.68.178 port 5201
[ 15] local 162.250.191.179 port 57394 connected to 157.22.68.178 port 5201
[ 17] local 162.250.191.179 port 57396 connected to 157.22.68.178 port 5201
[ 19] local 162.250.191.179 port 57402 connected to 157.22.68.178 port 5201
[ 21] local 162.250.191.179 port 57408 connected to 157.22.68.178 port 5201
[ 23] local 162.250.191.179 port 57418 connected to 157.22.68.178 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  11.2 MBytes  94.4 Mbits/sec  442   1.44 MBytes       
[  7]   0.00-1.00   sec  11.1 MBytes  93.3 Mbits/sec  421   1.36 MBytes       
[  9]   0.00-1.00   sec  9.12 MBytes  76.5 Mbits/sec  292   1.04 MBytes       
[ 11]   0.00-1.00   sec  6.88 MBytes  57.7 Mbits/sec  169    612 KBytes       
[ 13]   0.00-1.00   sec  8.62 MBytes  72.3 Mbits/sec  378   1.13 MBytes       
[ 15]   0.00-1.00   sec  11.9 MBytes  99.6 Mbits/sec  396   1.43 MBytes       
[ 17]   0.00-1.00   sec  9.25 MBytes  77.6 Mbits/sec  234   1.13 MBytes       
[ 19]   0.00-1.00   sec  12.8 MBytes   107 Mbits/sec  686   1.82 MBytes       
[ 21]   0.00-1.00   sec  10.6 MBytes  89.1 Mbits/sec  370   1.16 MBytes       
[ 23]   0.00-1.00   sec  12.9 MBytes   108 Mbits/sec  561   1.77 MBytes       
[SUM]   0.00-1.00   sec   104 MBytes   875 Mbits/sec  3949

So 4.6X faster with 10 streams than one stream.

I rest my case, your honor.

P.S. Stux, I send you a direct message here

stk · September 23, 2024, 1:41am

thanks for the overwrite clarification. that’s what i thought. the AI chatbots got it wrong.

As for ZFS limitation of a single stream, that makes no sense to me. Each dataset is independent and can be restored independently. Is there some sort of “global lock” preventing more than one dataset to be modified at a time"? This seems really counter-intuitive… it would imply you can only write to one dataset at a time.

So I’m skeptical it is a ZFS limitation. Am I missing something?

winnielinnie · September 23, 2024, 2:14am

I believe with ZFS you can send separate streams (to/from separate datasets).

However, if you’re issuing a recursive replication, then it’s treated as a single stream, even though multiple datasets are involved. (A single “task”.)

sfatula · September 23, 2024, 3:46am

I use Servarica and it sends much faster (don’t have an exact number but that’s slow!) can send it, runs about 4 minutes each night. Not sure about other countries though. I would check the RTT. The tcp window size could be adjusted if need be to improve on that speed.

It is a zfs limitation. You need to talk to zfs folks if you want to understand the details and why. That being the case, I don’t think that is your issue. That speed is slow, too slow.

stk · September 23, 2024, 7:09pm

let’s test it out then. I have iperf3 servers running on both my truenas boxes (using startup command in GUI advanced to start iperf3 -s).

You can test against:

iperf3 -c iperf3.skirsch.com
iperf3 -c backup.skirsch.com

Can you report what you get?

Both have 1Gbps or higher internet connections and are not loaded. I think you’ll rarely get above 20 to 30 MB/sec and it will vary from second to second.

sfatula · September 23, 2024, 7:39pm

Heading out for a bit, ran the second one, got this:

iperf3 -c backup.skirsch.com
Connecting to host backup.skirsch.com, port 5201
[  5] local 207.90.193.240 port 51182 connected to 162.250.191.179 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   146 MBytes  1.22 Gbits/sec  14428   46.7 KBytes       
[  5]   1.00-2.00   sec   119 MBytes   996 Mbits/sec  11748   52.3 KBytes       
[  5]   2.00-3.00   sec   120 MBytes  1.01 Gbits/sec  11513   46.7 KBytes       
[  5]   3.00-4.00   sec   102 MBytes   860 Mbits/sec  12445   63.6 KBytes       
[  5]   4.00-5.00   sec   134 MBytes  1.12 Gbits/sec  12830   46.7 KBytes       
[  5]   5.00-6.00   sec   120 MBytes  1.01 Gbits/sec  12816   50.9 KBytes       
[  5]   6.00-7.00   sec   118 MBytes   986 Mbits/sec  12514   59.4 KBytes       
[  5]   7.00-8.00   sec   119 MBytes   996 Mbits/sec  13669   45.2 KBytes       
[  5]   8.00-9.00   sec   119 MBytes   996 Mbits/sec  8760   17.0 KBytes       
[  5]   9.00-10.00  sec   115 MBytes   965 Mbits/sec  10577   55.1 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.18 GBytes  1.02 Gbits/sec  121300             sender
[  5]   0.00-10.00  sec  1.18 GBytes  1.01 Gbits/sec                  receiver

And from an Oracle machine throttled to 100 Mb/s average:

iperf3 -c backup.skirsch.com
Connecting to host backup.skirsch.com, port 5201
[  5] local 10.0.0.90 port 35226 connected to 162.250.191.179 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  14.2 MBytes   119 Mbits/sec  580   1.16 MBytes       
[  5]   1.00-2.00   sec  18.8 MBytes   157 Mbits/sec    0   1.23 MBytes       
[  5]   2.00-3.00   sec  17.5 MBytes   147 Mbits/sec    1    940 KBytes       
[  5]   3.00-4.00   sec  10.0 MBytes  83.9 Mbits/sec    1    713 KBytes       
[  5]   4.00-5.00   sec  11.2 MBytes  94.4 Mbits/sec    0    755 KBytes       
[  5]   5.00-6.00   sec  11.2 MBytes  94.4 Mbits/sec    0    782 KBytes       
[  5]   6.00-7.00   sec  12.5 MBytes   105 Mbits/sec    0    798 KBytes       
[  5]   7.00-8.00   sec  11.2 MBytes  94.4 Mbits/sec    0    803 KBytes       
[  5]   8.00-9.00   sec  12.5 MBytes   105 Mbits/sec    0    805 KBytes       
[  5]   9.00-10.00  sec  10.0 MBytes  83.9 Mbits/sec    1    626 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   129 MBytes   108 Mbits/sec  583             sender
[  5]   0.00-10.07  sec   126 MBytes   105 Mbits/sec

stk · September 23, 2024, 11:34pm

OK! So now I’m completely baffled by that.

Any ideas on how to troubleshoot this?

between systems INSIDE my LAN, i get 5 Gbit/sec, so it’s not my computer and speedtest shows 2.2Gbit/sec upload to Internet sites.

I am baffled.

NickF1227 · September 23, 2024, 11:44pm

stk:

iperf3 -c service.skirsch.com -P10
Connecting to host service.skirsch.com, port 5201
[  5] local 162.250.191.179 port 57348 connected to 157.22.68.178 port 5201
[  7] local 162.250.191.179 port 57364 connected to 157.22.68.178 port 5201
[  9] local 162.250.191.179 port 57366 connected to 157.22.68.178 port 5201
[ 11] local 162.250.191.179 port 57372 connected to 157.22.68.178 port 5201
[ 13] local 162.250.191.179 port 57382 connected to 157.22.68.178 port 5201
[ 15] local 162.250.191.179 port 57394 connected to 157.22.68.178 port 5201
[ 17] local 162.250.191.179 port 57396 connected to 157.22.68.178 port 5201
[ 19] local 162.250.191.179 port 57402 connected to 157.22.68.178 port 5201
[ 21] local 162.250.191.179 port 57408 connected to 157.22.68.178 port 5201
[ 23] local 162.250.191.179 port 57418 connected to 157.22.68.178 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  11.2 MBytes  94.4 Mbits/sec  442   1.44 MBytes       
[  7]   0.00-1.00   sec  11.1 MBytes  93.3 Mbits/sec  421   1.36 MBytes       
[  9]   0.00-1.00   sec  9.12 MBytes  76.5 Mbits/sec  292   1.04 MBytes       
[ 11]   0.00-1.00   sec  6.88 MBytes  57.7 Mbits/sec  169    612 KBytes       
[ 13]   0.00-1.00   sec  8.62 MBytes  72.3 Mbits/sec  378   1.13 MBytes       
[ 15]   0.00-1.00   sec  11.9 MBytes  99.6 Mbits/sec  396   1.43 MBytes       
[ 17]   0.00-1.00   sec  9.25 MBytes  77.6 Mbits/sec  234   1.13 MBytes       
[ 19]   0.00-1.00   sec  12.8 MBytes   107 Mbits/sec  686   1.82 MBytes       
[ 21]   0.00-1.00   sec  10.6 MBytes  89.1 Mbits/sec  370   1.16 MBytes       
[ 23]   0.00-1.00   sec  12.9 MBytes   108 Mbits/sec  561   1.77 MBytes       
[SUM]   0.00-1.00   sec   104 MBytes   875 Mbits/sec  3949

Can you open a seperate terminal session and run ping against the same domain while you are running iperf3?
Like this:

Also FWIW, variability between different iPerf3 streams when running with threads is not unexpected, particularly when there is a network bottleneck of some kind.

stk · September 24, 2024, 12:01am

Here you go, courtesy of tmux. You cheated since your system was local.

I’m going from the US to Canada hence the long ping times.

So long ping latency seems to translate into lower overall throughput on a single stream, right?

NickF1227 · September 24, 2024, 12:10am

Not necessarily. Egressing and going through the internet and all of its magical pipes is it’s whole own separate thing. I wanted you to do that to test for buffer bloat, not for absolute ping times.

Depending on where the bottleneck is, that simple test can show that a specific bottleneck which manifests from congestion. As load increases, ping times can grow much higher. You don’t have this problem.

What CPU are you running on either end?

stk · September 24, 2024, 3:37am

cpu isn’t the issue. it’s all tcp parameters.

stk · September 24, 2024, 3:39am

3.3X faster now after tweaking lots of tcp networking parameters. no other change to anything else. I guess there is lots of room for improvement in this area.

With linux tweaks, I’m now getting 3.3X higher transfer rates than before.

root@truenas:/mnt/main/user/stk# iperf3 -c backup.skirsch.com
Connecting to host backup.skirsch.com, port 5201
[  5] local 192.168.1.115 port 44864 connected to 162.250.191.179 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  40.8 MBytes   343 Mbits/sec    0   14.5 MBytes
[  5]   1.00-2.00   sec  81.8 MBytes   686 Mbits/sec   13   15.2 MBytes
[  5]   2.00-3.00   sec  80.1 MBytes   672 Mbits/sec    2   14.5 MBytes
[  5]   3.00-4.00   sec  73.5 MBytes   617 Mbits/sec    3   15.2 MBytes
[  5]   4.00-5.00   sec  80.8 MBytes   678 Mbits/sec    2   14.2 MBytes
[  5]   5.00-6.00   sec  79.4 MBytes   666 Mbits/sec    2   14.2 MBytes
[  5]   6.00-7.00   sec  71.2 MBytes   597 Mbits/sec  591   15.4 MBytes
[  5]   7.00-8.00   sec  81.2 MBytes   681 Mbits/sec    2   15.3 MBytes
[  5]   8.00-9.00   sec  81.5 MBytes   683 Mbits/sec   13   14.8 MBytes
[  5]   9.00-10.00  sec  74.7 MBytes   627 Mbits/sec    3   14.3 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   745 MBytes   625 Mbits/sec  631             sender
[  5]   0.00-10.07  sec   745 MBytes   620 Mbits/sec                  receiver

iperf Done.
root@truenas:/mnt/main/user/stk# iperf3 -c backup.skirsch.com
Connecting to host backup.skirsch.com, port 5201
[  5] local 192.168.1.115 port 45012 connected to 162.250.191.179 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  35.0 MBytes   294 Mbits/sec    1   20.0 MBytes
[  5]   1.00-2.00   sec  86.4 MBytes   725 Mbits/sec    1   14.3 MBytes
[  5]   2.00-3.00   sec  61.8 MBytes   519 Mbits/sec    7   16.1 MBytes
[  5]   3.00-4.00   sec  82.1 MBytes   689 Mbits/sec    2   14.6 MBytes
[  5]   4.00-5.00   sec  91.1 MBytes   764 Mbits/sec    1   14.3 MBytes
[  5]   5.00-6.00   sec  71.2 MBytes   597 Mbits/sec    5   15.2 MBytes
[  5]   6.00-7.00   sec  78.2 MBytes   656 Mbits/sec    2   6.81 MBytes
[  5]   7.00-8.00   sec  80.5 MBytes   676 Mbits/sec  611   6.79 MBytes
[  5]   8.00-9.00   sec  71.5 MBytes   599 Mbits/sec   14   14.8 MBytes
[  5]   9.00-10.00  sec  83.0 MBytes   696 Mbits/sec    2   14.9 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   741 MBytes   621 Mbits/sec  646             sender
[  5]   0.00-10.07  sec   741 MBytes   617 Mbits/sec                  receiver

iperf Done.
root@truenas:/mnt/main/user/stk# iperf3 -c backup.skirsch.com
Connecting to host backup.skirsch.com, port 5201
[  5] local 192.168.1.115 port 45376 connected to 162.250.191.179 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  41.5 MBytes   348 Mbits/sec    0   14.6 MBytes
[  5]   1.00-2.00   sec  96.8 MBytes   812 Mbits/sec    0   14.3 MBytes
[  5]   2.00-3.00   sec  74.6 MBytes   626 Mbits/sec    4   14.1 MBytes
[  5]   3.00-4.00   sec  81.2 MBytes   681 Mbits/sec    2   15.2 MBytes
[  5]   4.00-5.00   sec  58.3 MBytes   489 Mbits/sec  112   6.61 MBytes
[  5]   5.00-6.00   sec  77.6 MBytes   651 Mbits/sec  918   5.74 MBytes
[  5]   6.00-7.00   sec  69.7 MBytes   584 Mbits/sec   15   15.2 MBytes
[  5]   7.00-8.00   sec  88.3 MBytes   741 Mbits/sec    2   14.2 MBytes
[  5]   8.00-9.00   sec  80.4 MBytes   675 Mbits/sec    3   6.82 MBytes
[  5]   9.00-10.00  sec  79.0 MBytes   663 Mbits/sec    2   14.1 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   747 MBytes   627 Mbits/sec  1058             sender
[  5]   0.00-10.07  sec   747 MBytes   622 Mbits/sec                  receiver

sfatula · September 24, 2024, 3:44am

Now it’s getting to reasonable levels where performance might become a barrier more than the network. Very good! Figured it was a network issue.

NickF1227 · September 24, 2024, 3:54am

You’re not wrong, but you really are just solving a software bottleneck that’s being hung up my single core performance. If you don’t mind my asking, what kernel parameters did you tune in the TCP stack?

stk · September 24, 2024, 4:17am

i will have to go back and look. i just keep reading articles and trying things. But i’m nearly saturating the link now showing the linux kernel has huge room for improvement.