I have a Chelsio T520-SO-CR. It’s been working fine for about 6 years now. After moving over to Scale a couple of months ago, I’m getting these errors in /var/log/messages every few days:
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: It has been corrected by h/w and requires no further action
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: event severity: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: Error 0, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: command: 0x0102, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: device_id: 0000:17:00.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: vendor_id: 0x1425, device_id: 0x5007
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: class_code: 020000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: Error 1, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: command: 0x0102, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: device_id: 0000:17:00.1
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: vendor_id: 0x1425, device_id: 0x5007
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: class_code: 020000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: Error 2, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: command: 0x0102, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: device_id: 0000:17:00.2
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: vendor_id: 0x1425, device_id: 0x5007
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: class_code: 020000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: Error 3, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: command: 0x0102, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: device_id: 0000:17:00.3
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: vendor_id: 0x1425, device_id: 0x5007
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: class_code: 020000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: Error 4, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: command: 0x0506, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: device_id: 0000:17:00.4
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: vendor_id: 0x1425, device_id: 0x5407
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: class_code: 020000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: Error 5, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: command: 0x0106, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: device_id: 0000:17:00.5
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: vendor_id: 0x1425, device_id: 0x5507
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: class_code: 010000
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: Error 6, type: corrected
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: section_type: PCIe error
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: port_type: 0, PCIe end point
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: version: 3.0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: command: 0x0106, status: 0x0010
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: device_id: 0000:17:00.6
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: slot: 0
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: secondary_bus: 0x00
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: vendor_id: 0x1425, device_id: 0x5607
Oct 30 15:52:31 nas kernel: {2}[Hardware Error]: class_code: 0c0400
Oct 30 15:52:31 nas kernel: cxgb4 0000:17:00.0: [ 0] RxErr (First)
Oct 30 15:52:31 nas kernel: cxgb4 0000:17:00.1: [ 0] RxErr (First)
Oct 30 15:52:31 nas kernel: cxgb4 0000:17:00.2: [ 0] RxErr (First)
Oct 30 15:52:31 nas kernel: cxgb4 0000:17:00.3: [ 0] RxErr (First)
Oct 30 15:52:31 nas kernel: cxgb4 0000:17:00.4: [ 0] RxErr (First)
Oct 30 15:52:31 nas kernel: pci 0000:17:00.5: [ 0] RxErr (First)
Oct 30 15:52:31 nas kernel: pci 0000:17:00.6: [ 0] RxErr (First)
The error does seem to interrupt network activity, as it occurred once while I was doing a file transfer and the transfer stopped.
It seems to be a bit random, often occurring while there is no heavy activity. I was able to trigger it using iperf3 once, though not reliably so.
I did also try moving the NIC to another PCI slot, but this didn’t help.
Thoughts? NIC going bad? Something happening in Scale? (Running 24.10 release)