NIC Errors in logs

I have a Chelsio T520-SO-CR. It’s been working fine for about 6 years now. After moving over to Scale a couple of months ago, I’m getting these errors in /var/log/messages every few days:

Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: It has been corrected by h/w and requires no further action
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: event severity: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:  Error 0, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   command: 0x0102, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   device_id: 0000:17:00.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   vendor_id: 0x1425, device_id: 0x5007
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   class_code: 020000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:  Error 1, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   command: 0x0102, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   device_id: 0000:17:00.1
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   vendor_id: 0x1425, device_id: 0x5007
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   class_code: 020000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:  Error 2, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   command: 0x0102, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   device_id: 0000:17:00.2
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   vendor_id: 0x1425, device_id: 0x5007
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   class_code: 020000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:  Error 3, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   command: 0x0102, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   device_id: 0000:17:00.3
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   vendor_id: 0x1425, device_id: 0x5007
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   class_code: 020000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:  Error 4, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   command: 0x0506, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   device_id: 0000:17:00.4
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   vendor_id: 0x1425, device_id: 0x5407
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   class_code: 020000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:  Error 5, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   command: 0x0106, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   device_id: 0000:17:00.5
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   vendor_id: 0x1425, device_id: 0x5507
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   class_code: 010000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:  Error 6, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   command: 0x0106, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   device_id: 0000:17:00.6
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   vendor_id: 0x1425, device_id: 0x5607
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]:   class_code: 0c0400
Oct 30 15:52:31 nas kernel: cxgb4 0000:17:00.0:    [ 0] RxErr                  (First)
Oct 30 15:52:31 nas kernel: cxgb4 0000:17:00.1:    [ 0] RxErr                  (First)
Oct 30 15:52:31 nas kernel: cxgb4 0000:17:00.2:    [ 0] RxErr                  (First)
Oct 30 15:52:31 nas kernel: cxgb4 0000:17:00.3:    [ 0] RxErr                  (First)
Oct 30 15:52:31 nas kernel: cxgb4 0000:17:00.4:    [ 0] RxErr                  (First)
Oct 30 15:52:31 nas kernel: pci 0000:17:00.5:    [ 0] RxErr                  (First)
Oct 30 15:52:31 nas kernel: pci 0000:17:00.6:    [ 0] RxErr                  (First)

The error does seem to interrupt network activity, as it occurred once while I was doing a file transfer and the transfer stopped.

It seems to be a bit random, often occurring while there is no heavy activity. I was able to trigger it using iperf3 once, though not reliably so.

I did also try moving the NIC to another PCI slot, but this didn’t help.

Thoughts? NIC going bad? Something happening in Scale? (Running 24.10 release)

Any thoughts? Also wanted to add that iperf3 does actually seem to be a reliable trigger if I let it run for about 10 minutes. So this issue appears pretty sensitive to load on the NIC.

It could be the card overheating. What kind of cooling solution do you have? Is there ample airflow going over the card?

1 Like

There is a 120mm fan mounted directly over the card blowing air at it, so it really shouldn’t be overheating.

If you’ve verified that the fan spins as expected I’m drawing a blank as to possible causes for this.

Good luck.

@neofusion Thanks. I appreciate the idea.

Several TrueNAS enterprise systems ship with Chelsio nics, so the likelihood of it being a driver issue is probably low. You’ve already physically tried a different slot…so I’d suggest buying a new NIC, as the errors indicate this is a hardware issue

2 Likes

Thanks @NickF1227 I’m leaning towards replacing it as well.

Any opinions on the Intel E810-XXVDA2 and compatibility with Scale? At some point I’d like to upgrade my network to 25gb, so might as well start now.

I’ve not used that NIC, but I can speak positively of Mellanox connectx4

1 Like

I have replaced the Chelsio NIC with an Intel E810-XXVDA2 and so far am unable to reproduce the errors even after transferring a few TB via iperf. The Intel NIC is working just fine in Scale running at 10gb. I do not have a way to test 25gb speeds at this point.

2 Likes