How to diagnose transfer speed

Hello all,

I’ve recently acquired two Microserver G8 and upgraded them with Xeon 1265 and 16GB of RAM (the max they can have).

I’ve created a pretty large NAS - 4 x 14TB - which I appreciate is beyond the recommended RAM/Storage ratio. My Network is only 1Gbps and they are going to be used as basic, single-user storage NAS though so I am being optimistic.

After RSync’ing my old NAS to the new one, I’ve copied some content from an external SSD this morning and I see this little step in incoming speed.

It seems to be consistent, that is, it stays there and won’t go up anymore.

The CPU on the G8 is running at 10% on average. The files are coming from macOS. Files are mostly all video files so pretty large ones. I might have something else using the network but not 200Mbit/s. Also, it was able to saturate the bandwidth at the beginning of the transfer.

Drives don’t seem to show a change in their behaviour. The little drop at 9:52 was probably me using the network for something else for a few minutes.

Any pointers on how to troubleshoot this sudden drop in speed?

Thanks!

What do you mean by external SSD ? How was this SSD connected to Truenas ? Or was it connected to the Mac ?
Whats your Pool layout for the 4 drives ?

I would first use iperf3 to test the max possible networt speed. Iperf3 is available though the command line and there exists afaik also a macos version.

1 Like

Thank you for the pointer and sorry if I was not clear.

I was transferring from an external USB4 SSD connected to my Mac TO Truenas.

I ran iperf3 on the server - and another identical one (just empty) which is sitting on the same switch. They’re both on Raidz1

This is the server in question

This is the identical, empty server

The commands I ran were:

iperf3 -c 192.168.0.235

iperf3 -c 192.168.0.235 -P 4

Can you help me interpreting the results?

Thanks again for the help!

Seems to be at close to 900mbps, looking good from a purely networking perspective imo if you’re using 1gig.

I’m confused on what the issue is - interface is in Gigabits per second & it is reasonably close to 1. Then you have disk io for sda at ~100 Mebibytes/s, which is roughly ~800Mb/s. Which is reasonably close to a gig link.

Imo you’re getting expected speeds on your Gbps network when you factor in overhead. Are you worried that speeds dropped from >900mbps to apprx 800mbps after 9:55pm?

1 Like

I assume so… could have a bunch of reasons. Maybe at that time a lot of small files were transferred which increased the overhead.

1 Like

In each of the pictures, the top run shows speeds in the 200mbit/s range.

So there seems to be a network limitation for these kind of transfers.

Are you worried that speeds dropped from >900mbps to apprx 800mbps after 9:55pm?

Yes and it didn’t recover from there

Maybe at that time a lot of small files were transferred which increased the overhead.

That was a video footage archive, 99% of the files were pretty large. And it seems that the box has been at max 800Mbps since that time

n each of the pictures, the top run shows speeds in the 200mbit/s range.

The servers are in an outbuilding and I know I had issues with network there before. Some adaptors can run 1Gbps perfectly, others will jump up and down a bit.

iperf3 seems to confirm that whatever issue there is, it’s a network issue and it’s nothing to do with TrueNAS

I’ve just checked on a different box which is sitting on a “more reliable” section of my network and speeds are more consistent. I’ll move the TrueNAS server there and see if things change - I suppose so.

Thanks so far!

Fair nuff - that seems like much more of a pain in the ass to diagnose; would be much easier to try & fix dead or much slower link :frowning:

for sure.

I wired the outbuilding some time ago, I inadvertently went with an 35m UTP outdoor-rated cable but it’s always been a bit dodgy. It works, but some adaptors will go flat-out 1Gbps, some others will jump up and down a bit.

Back some time ago I tested whatever device was giving me issue and tested it with a good, long cable directly to my main switch in the house and it liked that. Since then, I changed some bits and it seemed to be more stable.

I’d imagine the signal is borderline. I have ordered a cat6a S/FTP which I hope will take care of the issue.

iperf3 is definitely the tool for diagnosing this issue - thanks for now!

as a confirmation that the issue is elsewhere, I started a duplication today between the two boxes which are connected to the same switch.

It’s clearly an issue with the rest of the network - those periodic drops I’d imagine are expected?

But it’s clearly reaching full 1Gbit/s capacity with no hesitation.

Weird - anything interesting or out of ordinary in the logs?

tail -n 500 /var/log/messages | less

Any chance of anything interesting in IPMI in terms of temps? Uhh - otherwise donno, reboot switch, reseat connections, retest?

you’re saying those periodic drops are not to be expected?

Temps are good, the boxes are in a cold-ish place. I’ve recently replaced all the thermal compound everywhere, the boxes are basically idle :slightly_smiling_face:

Here are the last few days, not much I think. Do you see anything?

Mar  2 21:00:08 Truenas-Main netdata[2791]: CONFIG: cannot load cloud config '/var/lib/netdata/cloud.d/cloud.conf'. Running with internal defaults.
Mar  2 21:28:10 Truenas-Main kernel: usb 2-1: USB disconnect, device number 2
Mar  2 22:02:47 Truenas-Main kernel: usb 3-1.3: USB disconnect, device number 3
Mar  2 22:03:15 Truenas-Main kernel: usb 3-1.3: new high-speed USB device number 4 using ehci-pci
Mar  2 22:03:15 Truenas-Main kernel: usb 3-1.3: New USB device found, idVendor=0424, idProduct=2660, bcdDevice= 8.01
Mar  2 22:03:15 Truenas-Main kernel: usb 3-1.3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
Mar  2 22:03:15 Truenas-Main kernel: hub 3-1.3:1.0: USB hub found
Mar  2 22:03:15 Truenas-Main kernel: hub 3-1.3:1.0: 2 ports detected
Mar  3 01:53:57 Truenas-Main kernel: perf: interrupt took too long (2507 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Mar  3 03:32:19 Truenas-Main kernel: perf: interrupt took too long (3142 > 3133), lowering kernel.perf_event_max_sample_rate to 63500
Mar  3 05:49:13 Truenas-Main kernel: perf: interrupt took too long (3968 > 3927), lowering kernel.perf_event_max_sample_rate to 50250
Mar  3 15:13:02 Truenas-Main kernel: perf: interrupt took too long (4974 > 4960), lowering kernel.perf_event_max_sample_rate to 40000
Mar  5 03:19:33 Truenas-Main kernel: perf: interrupt took too long (7412 > 6217), lowering kernel.perf_event_max_sample_rate to 26750
Mar  6 15:20:56 Truenas-Main kernel: tg3 0000:03:00.0 eno1: Link is down
Mar  6 15:21:06 Truenas-Main kernel: tg3 0000:03:00.0 eno1: Link is up at 1000 Mbps, full duplex
Mar  6 15:21:06 Truenas-Main kernel: tg3 0000:03:00.0 eno1: Flow control is on for TX and on for RX
Mar  6 15:21:06 Truenas-Main kernel: tg3 0000:03:00.0 eno1: EEE is enabled
Mar  6 15:27:17 Truenas-Main kernel: tg3 0000:03:00.0 eno1: Link is down
Mar  6 15:27:26 Truenas-Main kernel: tg3 0000:03:00.0 eno1: Link is up at 1000 Mbps, full duplex
Mar  6 15:27:26 Truenas-Main kernel: tg3 0000:03:00.0 eno1: Flow control is on for TX and on for RX
Mar  6 15:27:26 Truenas-Main kernel: tg3 0000:03:00.0 eno1: EEE is enabled
Mar  6 15:29:45 Truenas-Main kernel: tg3 0000:03:00.0 eno1: Link is down
Mar  6 15:29:54 Truenas-Main kernel: tg3 0000:03:00.0 eno1: Link is up at 1000 Mbps, full duplex
Mar  6 15:29:54 Truenas-Main kernel: tg3 0000:03:00.0 eno1: Flow control is on for TX and on for RX
Mar  6 15:29:54 Truenas-Main kernel: tg3 0000:03:00.0 eno1: EEE is enabled

Why is your link down for some seconds and then up again? Did you unplug and re-plug your network cable at that time? Also, depending on the NIC, EEE [1] might be an issue. It shouldn’t but it sometimes is. You can disable this way:

ethtool --set-eee interface_name eee off

So in your case that would be

ethtool --set-eee eno1 eee off

  1. Energy Efficient Ethernet ↩︎

1 Like

Why is your link down for some seconds and then up again? Did you unplug and re-plug your network cable at that time?

Yes that was me this afternoon testing some alternative routes.

I’ll try disabling EEE, thanks for the idea!

If it works it might not survive a reboot. There was a post about this somewhere here…

Found it!

There you find the init script they used, if you are unsure how to do that. You can ignore the rest as they are talking about a different issue.

Yeah other than some usb disconnect, and the link bouncing on March 6th, I can’t say I see anything crazy. It is weird that it keeps lowering so beautifully 5-7 times per 5 minute interval. Unless you have identical file sizes I can’t really explain it.

Thanks @ProfessionalAmateur

@Fleshmauler

I thought maybe the Replication task would divide the load in big chunks. Not sure myself, I’ve seen network speed doing that before on different systems and I thought it was normal. I’ll diagnose more but what matters is that besides those small pauses, the interfaces are running flat out at 1Gbps with no further hesitations. It’s an improvement on where we started from :slightly_smiling_face:

I’m wondering whether having HP iLO on the same main Network Interface could be the issue here. There is an option to have it on a dedicated port but I chose the “shared” network instead.

Thanks for now.

I’d generally advise putting iLO in a dedicated NIC if possible. For security reasons but also for performance. Besides that, I can imagine it could have an impact since iLO is of course creating some traffic. While this alone wouldn’t matter much, it could theoretically interrupt stuff or cause issues switching MTU and so on. Depending on the NIC and possibly many other factors. But I never looked into that as deep since I always segment iLO from the storage network.

Let me know if you find anything out. Curious now.

I set iLO on its own port - which is currently unplugged. Restarted.

Same behaviour (plus some extra slowdowns but I’d imagine it’s normal in real world. Though the only difference on this replication is a new 500GB Time Machine backup so a single file. I’ll try the power saving next.

Before I disable power saving, I ran another test with iperf3 between the two boxes this time.

The machines are wired to the same switch.

I’d say the links are healthy, but I see the same brief drops - which are not really shown in the iperf3 from what I see (those are 30s segments though) so maybe I am chasing ghosts here. Though I am curious myself now.