First off, this is a follow up post to my post on the pfSense forum as i want to ensure i have not missed anything
So the elaborate, i am 99% sure that i have a faulty I350 but as this is my second unit after an initial RMA for the same issue, i want to ensure i have not missed anything
My system specs are as follows:
Cpu: AMD 9700X
Mobo: MAG B850 TOMAHAWK MAX WIFI
Memory: 128GB
OS: Scale 24.10.2.2
This is a virtualised pfSense router with all 4 ports of the I350 passed through. The system works flawlessly for around 6 to 9 hours then starts throwing trillions of errors on all 4 interfaces at the same time and all connections are lost but the VM is still up and responds on the console
The previous card had the same issue which makes me think it is a faulty unit but i want to check i have not missed anything.
(i have tried uploading an image to show it but that is not allowed)
I have swapped the card for a known working I340-T4 and that works flawlessly for weeks with no problems under the exact same config and settings
Is anyone aware of any known issues or something i may have missed with this that could be causing the issue?
I have pushed a few updates over on the netgate forum for anyone that may be following this, if there are any of the images i have posted that you feel may be relevant to include here for consistency, please let me know
Ok so as suggested on the netgate forum, i have removed the pci passthrough for now and have the vm configured with virtual nics passing through from the host directly with them simply unconfigured
One issue i have and this is likely more for over here, with this approach, CARP addresses do not work via the virtual nic so i can’t really test much besides leaving it to just run
I’m assuming this is due to the way carp works. I do have
Trust Guest Filters enabled on all the interfaces but that does not seem to help
Is anyone here aware of a way i can get this working to at least debug this properly?
Edit got it, needed to add each interface as a bridge to allow it to work. Now i can test it properly
I have moved the card from a pci 3.0 1x slot to a gen 5 16x slot just in case its a pci slot issue but based on the above, i’m thinking its a second faulty nic
Replying so you’re not just talking to yourself.
Are you getting these cards from a good source. Two in a row sounds like a knockoff type problem. It’s just
Both cards were from the same source with the second being a replacement after my initial RMA, so i’m thinking the company likely got them from the same location
I have checked the cards physically and they seem legit but you just never know with these cards
I would post the company but they have been really good and understanding with this so i don’t think it would be fair on them but at the same time, i don’t want others going through the same!
The company themselves are refurbishers/resellers and have a pretty good reputation as a whole but you never know where they got the original stock from
I don’t think this is a hardware issue (put a fan blowing on the card, to ensure is not heat related).
But do love non-electronic people giving their illogical theories!
The cheap cards use the same main components and simply save/skip on a few non-critical elements.
I have used the cheap 4 port Chinese cards and only one went bad, possibly due to the inrush current flaw, but it failed fully so I just exchanged it for another card, that never failed for the next 5-7 years!
Ensure hardware offloading is turned off.
Do high-speed transfers to see if you can make it fail on-command.
Its currently at 4 days in the x16 slot so it may genuinely be an incompatibility with the i350 and 1x, vs i340 which are known to work in a 1x slot (i found this out today)
The unit is in a server case with 5 80mm fans blowing at it in an air conditioned 19 degree room, the unit itself is cool and never gets warm
I did also test a mixture of completely idles besides a few pings for 12 hours or more, constant speedtests and iperfs at gigabit speed on at least 2 ports at a time pushing hundreds of gigabytes through, nothing was able to trigger it directly, it was random in itself
At the moment it is looking like it may be related to the way SRI-OV and the additional features on this card function with the slot and it may simply be that it can not negotiate down to 1x and still do everything it needs so the firmware on the card goes haywire
I did not see anything when i was looking before ordering the card that this could be an issue and to my knowledge, it should fairly gracefully work with a 1x all be it bandwidth limited but this does not seem to be the case for this model mixed the an AM5 board
For anyone interested in the exciting conclusions… it worked fine in the 16x slot for 2 weeks and is still in there now
I put an I340-T4 in the 1x slot at the same time and left that running and that has been perfectly fine as well
It seems to be an incompatibility between the 1x slot and the I350 specifically but i’m not sure why. In either case, the issue seems to be resolved
It may be something specific to AM5 and the I350 in the 1x, or just the I350 and the 1x alone but if anyone else for some reason tries the same, at least you know what symptoms manifest and what the cause was