Strange networking things are afoot

Hi there,

I have a Supermicro 1U server with 8 x U.2 NVMe SSDs in 2 x RAIDZ1 | 4 wide | 6.99 TiB

The server has an AMD Epyc 7352, 24 core processor, and 128Gb RAM.

It also has a 4-port 10Gig-BaseT card in it, with all 4 ports going to the same 10-gig 24-port switch (on 192.168.22.32, 192.168.23.32, 192.168.24.32, and 192.168.25.32 - in order that iSCSI round-robin would work properly in VMware). Only the 192.168.22.32 address has a router associated to it and hence can get out to the internet for patches that way.

I have 6 x VMware servers with a pair of 10-gig NICs in each, connected to the same 24-port 10-gig switch as the TrueNAS, and it shares out a 32Tb datastore between the hosts.

Up until a few weeks ago it worked fine, and this has worked fine for about 2 years. However, a few weeks ago I was in the datacentre doing some work and decided that the network cables were too long and might be causing airflow issues in the rack, so I went about swapping them one by one, from the 192.168.25 connection, to the 24 connection, then the 23 connection, and finally the 22 connection. All worked fine until I got to the 22 connection, and the NIC link light refused to come back on. I thought WTF and tried changing cables, ports, switches, to no avail and the link light refused to come back on. I even rebooted the TrueNAS and it didn’t help.

So, I ordered a new NIC. Sods law dictates that I could only get a 2-port 10-gig NIC in time for the xmas break, so I went with that.

Tonight I went to the datacentre, powered down the TrueNAS box and inserted the new NIC. Powered everything back up and…

…the .22 NIC gives me a link light, but now the .23 NIC doesn’t.

At this point, given that I was a few patches behind current and I now had the .22 NIC back which gave me internet access, I updated TrueNAS and rebooted it. It consistently gave me the .22 NIC with a link light and working, and the .23 NIC without a link light…

…so I took the cable out of the .23 NIC and put it in one of the new NICs (the new NICs are an Intel 2-port X550-T2). The new NIC gives a link light. I do some rearranging of IP aliases on the installed NICs, and I then have 4 NICs in total that PING from my laptop AND give a link light on the switches and NICs themselves…BUT one of the NICs says ā€œLink State Downā€ on the dashboard and is passing 0 bytes either way, however in the ā€œNetworkā€ section it shows that the network link is up (but still passing 0 bytes).

Anyone know how to solve this?

Is the switch configured with VLANs?

Can you disable switch ports and bring them up one at a time… when does it break with simplest config?

No there’s no VLANs on the switch. It’s all flat.

It breaks straight away - no switch config.

Enable 1 port at a time… what’s the smallest config that fails?

Is it failing at:

Physical level - the cable/port
Mac Level - packets don’t get through switch
IP Level - no path

Ping from the NAS

Ping from the laptop is always from one network??

You’ll need to clarify where you expect me to enable/disable 1 port (this is a production device serving 6 ESXi hosts btw). In TrueNAS? or elsewhere in hardware somehow? (this is a 4-port NIC so there’s no removing a single port from that).

As can be seen from my screenshots though, with all 4 ports enabled at the hardware level I get physical connectivity, I get MAC level, I get IP level as you can PING the interface, but TrueNAS seems to think otherwise in its internals somehow.

It looks like he meant the ports on the switch, disable all 4 and then activate one after the other to see at what point it breaks.

1 Like

I’ll try that out of hours and report back since that would be a service disruption.

First to clarify the conditions… you have this, right?

  • 4 ports with IP addresses from 4 different subnets on the TrueNAS
  • all 4 ports are connected to the same 10g switch
  • the switch has no vlans configured (all uplink ports are with no tagging)

Now some questions.

Do you have access to the switch to validate the MAC/ARP tables and eventually to clear them?
(in such a mixed broadcast domain, the most probable issue is with the mac table)

What type of switch you are using and on which ports the 4 TrueNAS NIC are connected?
(some switches has port-group limitations that can be affected from the negotiation process)

What is the switch uptime?
(if it’s 2 years, touching the ports can produce ā€œexcitingā€ side effectes)

Cheers

1 Like

All these people telling you about vlans, Mac, ARP, and other higher level details that do not matter for the main issue!

Link light is completely independent from those settings!

It means, bad cable, port, no common speed/failed auto negotiation, or interface down in OS.

Did you try the old cable that was working?

(Login to the shell an get output of: ip a).

It sounds like an STP/RSTP ā€˜port lockout’ on the switch side. Any physical port reconfiguration should cause the switch to automatically reconfigure it’s MAC table… but sometimes that doesn’t always occur, depending on the switch software. A power cycle/reboot of the switch should be done in order to forcibly clear old entries in the switch MAC tables. That should resolve any STP/RSTP issues.

All the best…

Bill

For the records :wink: link-light can be caused by multiple reasons - bad cable/transceiver is only one of them and it could be identified easy if the UDLD or BFD is used for advanced failure detection.

Few more possibilities to be considered:

  • security violation on a protected port
  • incomplete or stuck negotiation
  • spanning tree driven port state (blocking or disabled)
  • different configurations mismatches
  • bugs related side effects (memory leaks, TCAM overload etc.)

The indicators for some of those are related to low level details like the MAC and ARP tables and are logical next step, when the port and the cable are already replaced - especially when the problem is not following the calbe or the port in question (like in the descrived case)

yup, Bill - it could be… and it’s very possible in such a mixed broadcast domain with unclear topology

That’s why I asked if the access to the switch is available - spanning tree issue is easy to be identified with a simple CLI commands that depends from the switch type.

1 Like

So I am going to have to fully revert to you on this issue as well, since I get ā€œinvalid passwordā€ when trying to log into the GUI with the known duple so it could benefit from a restart whatever the case. It’s a FS.com S5850-24XMG with no settings changes out of the box. I believe (but can’t confirm of course) that this means it’s got no STP turned on.
It’s a good shout though and I appreciate you and bweinel chiming in on the switch as I would generally assume it’s okay. Especially since I have a second 8-port 10-gig switch daisychained off it and haven’t managed to get a connection to that to work either.

Actually, I just managed to log into the switch (wrong password saved in my password manager). STP is turned off globally.

I’m going to reboot the switch tonight though.

Before to ā€œpush the buttonā€ :slight_smile: maybe is a good idea to collect some logs - your choice

show logging buffer
show interface {interface where TrueNAS is connected}
show interface status
show vlan brief
show interface switchport {interface where TrueNAS is connected}
show loopback-detect
show errdisable detect
show port-security address-table
show vlan-security
show port-block
show udld {interface where TrueNAS is connected}

the output can give you idea for the issue, before the reload and can serve as a base line for comparison after the reload

Cheers

OK I’m an idiot.

Have a look at the first of the two screenshots in my initial post, and at the interface names, and then see what’s happened.

For those of you as blind as I am, i’m graphing a different interface than is showing the IP address, so no wonder it’s offline. Goddammit. All that troubleshooting for nothing. Thanks to everyone who took part though.

1 Like