Do you have plenty of airflow over your HBA. They want and need high air flows going over the heat sinks.
From 9206-16e document but I expect about the same for all their cards
Minimum airflow:
— 100 linear feet per minute at 35 °C (95 °F) bay inlet temperature
— 150 linear feet per minute at 45 °C (113 °F) bay inlet temperature
— 200 linear feet per minute at 55 °C (131 °F) bay inlet temperature
Set all of your RAM speeds/timings to “Automatic” or “Stock” speeds - this will include any manually adjusted latency/timing values.
Also consider the airflow/temperature question posed by @SmallBarky - try sudo storcli /c0 show all | grep -i temperature to see if it is reporting back, as the 9500 should be new enough to have a temperature sensor and readout.
That’s not the issue. The issue is the system locking up. If your memory is not at 100% health (as @HoneyBadger said, just 1 error is too many) then your system can behave in random and mysterious ways, especially if a lot of memory is being used, since it increases the likelihood of hitting a faulty area in the bad RAM.
If the BIOS settings do not resolve this, I would figure out which of the 4 sticks are bad and replace them.
You should be able to pass multiple memtest runs with 0 errors before booting back into TrueNAS. Having bad RAM can put new data at risk.
no errors during test 8 and 9 in memtest after 4 passes…
System is booting as normal…
Restarting a file transfer > 150 GB…
aaand…
its working without a max limit in the settings
So RAM is healthy and its working.
So stupid
But thank you guys!
I had to reset the BIOS settings and set all to “Auto”. The BIOS detects the RAM now as 3600 as it should be.
Never checked the settings before because it was “stock”
He would still have had issues with 4x sticks of ECC memory with the bios manually set to 5200.
When you have 4 sticks in these systems they need to run at lower frequencies to keep the signal integrity… want full speed memory with ECC, you need to go to big boy EPYC cpu’s with registered ECC… and I don’t mean the 4004/4005 series which are still little boy epyc cpu’s
However I’m still advocate for ECC memory for even home server systems you want to run 24/7…
Just a shame its sooo hard to source affordable dd5 udimm ecc here in Australia let alone latest server grade hardware
I have seen bit flips in the past reported in my ipmi logs hence why I’m an advocate
Any DDR5 system has to downgrade speed when running at 2DPC.
The benefit of ECC in this case, beside a BIOS that is hopefully NOT designed towards overclocking, is that it would spam the logs with warnings and you’d know right away where the problem is.