A friend of mine convinced me to switch my home virtualized server from desktop Intel platform to server platform.
I bought AsRock Rack ROMED6U-2L2T board + EPYC 7303 (my build uses Fractal Node 804 case so I really need microATX board).
I’m still evaulating it and no issues except one.
BMC/IPMI event log gets flooded (~3-4 messages per minute) with “Thermal Trip - Asserted” events.
Event source is “Processor” but actual sensor is “Unknown”. After digging deep into these events, they all come from sensor # 94h which does not exist on sensor list (i.e. ipmitool sdr list), that’s probably why it is “Unknown”.
No alerts are triggered by actual CPU_THERMTRIP discrete sensor (sensor# 93h). All other sensors states are green. Obviously CPU isn’t overheheating - at idle it is at 35C deg., at 100% load - nearing 50 C deg. “sensors” command (from lm-sensors) states the same.
I tried two more CPUs (7282, 7203P) - same issue.
Looks like BMC firmware bug to me (why are events created at all by non-existent sensor ?) but hard to believe it did not get fixed in a mainboard released 4 years ago. So maybe a M/B hardware fault ?
Has anyone ever come across similar issue ?
Obviously I tried to reset everything to defaults, even downgrade BIOS and BMC firmware - no improvement.
I tried to contact Asrock Rack support but no response from them so far and autoresponse advises to contact seller in the first line.
I could put a warranty claim to the seller but I can only expect money return becuase the three ROMED6U-2l2T pieces they had sold out very quickly. And no other EPYC 7003 M/B on the market is microATX.
My concern is that these flooding messages may wear flash chip quickly. I could disable circulalr logging but then mo new messages are logged once event log gets full (it only takes a day to make it full).
ASRock Rack has been good to me with answering questions in sometimes less than 24 hours, sometimes a few days.
Be very detailed with them. Also ask if there is a new BMC/BIOS, even if it is a Beta version. I received a Beta version of my BIOS which fixed my problem. 10 months later they released the BIOS to everyone.
It sounds like you went down the wrong path to get support, use this link and file a Support Request. They will want a lot of details but I think you have it all.
Otherwise it sounds like you have done everything that I would do.
Good Luck!
P.S. I saw your posting on the ASRackRock Forum, sad no responses.
I used exactly this/your link to request support. But was not quite sure which region to use to report the issue - there were multiple choices like China, Asia (English), Asia (other), Germany, Netherlands, Brazil etc… Eventually I picked Germany (closest to me georgraphically) but not sure if it was a good choice.
I was trying to be as precise as I could when describing the issue.
I only received autoresponse that they would contact me shortly and recommending to contact dealer first for technical issues. It was a week ago and sadly no response since then.
Which region did you choose when reporting your issues ?
Lack of similar issue reports tell me it’s something rare hence probably a hardware fault of the M/B. If I had to replace the M/B then all other available EPYC options are ATX M/Bs which means significant change of my entire setup.
I put similar post on STH forum but no response there.
There is the following Micro-ATX Epyc 700x board; https://www.asrockrack.com/general/productdetail.asp?Model=ROMED8U-2T#Specifications
It is not quite the same as the ROMED6U-2L2T, as you loose 1 PCIe x16 slot, the 2 x 1Gbps Ethernet. But, you gain 2 more memory channels & slots. Though one odd thing with the ROMED6U-2L2T board, it may be possible to have up to 17 SATA… That does not appear to be the case for the ROMED8U-2T.
Just by pure luck I realized what’s happening but no idea why !
Another colleague of mine asked me to check how Windows Server runs on EPYC so I installed WinSrv2k22 on a separate SSD (next to Proxmox).
And out of pure curiosity I installed HWInfo and checked CPU temp sensors there.
It seems there are more CPU temp sensors than Linux/Proxmox reveals (via lm-sensors “sensors” command) or that BMC/IPMI reveals.
This “Thermal Trip” alert must be triggered by one of these BMC-undisclosed sensors, called “CPU die average” (still it is able to trigger BMC/IPMI thermal events).
All of the CCD or IOD sensors are fluctuating around 30-40 C deg while this “CPU die average” fluctuates around 70 C deg at idle (!) and triggers these alerts whenever it goes above 70 C deg.
It does not make any sense to me Especially having known that cooling unit’s contact plate is ligtly warm to touch (close to 30 C deg, not 70). Looks like someone has not divided it by a factor of two
When I set CPU fan to 100% it stays below 70 C deg but easily hits 100 C deg under 100% load.
Any idea why it may be so ?
I use Noctua NH-U12S TR4-SP3 cooling unit (U14 would be too tall for Node 804) with single push fan (no push-pull config yet). Initially I used Noctua thermal compound. As said I tested 3 different CPUs (7303, 7203P, 7282) so replacing thermal compound on such big IHS each time was a bit difficult and expensive so eventually decided to go with reusable Thermal Grizzly Carbonaut carbon pad (which did not make these alerts go away either).
Difficult to assume all three CPUs were faulty.
Difficult to assume it AGESA bug because Rome and Milan use completely different AGESA builds (RomePI and MilanPI).
I wasn’t aware another mATX EPYC board from Asrcok existed. Thanks !
Probably I would go with this higher DIMM slot count and less PCIe slots but this one is/was not available for purchase here. Asrock or Supermicro ATX EPYC 7002/7003 boards are readily available though (e.g. ROMED8-2T/BCM).
It’s not so easy. I see no option to disable event logging completely. I can only switch between circural and linear (=logging stops once logspace gets full) IPMI levent ogging.
Becuase that broken, hidden “CPU die average” temp sensor is not exposed in IPMI interface, I cannot disable this sensor (or at least change its thresholds) via IPMI CLI commands either.
It looks like a clear BMC firmware bug to me.
I Googled more and it looks that “average CPU die temp” sensor is broken in EPYC chips (shows way too high values than it should) and its readings should never be taken into account.
And it’s what BMC firmware does… partially. This sensor is not exposed in GUI or IPMI interface. But it is continously read by firmware and able to create thermal trip events once it goes over (also hidden and unmodifiable) threshold of 70 C deg.
I’ll try to create another support question to AsRock, now to US region.
Yes, I did. But that’s the root cause of the issue.
These events are triggered by sensor number 94h that is not visible anywhere in BMC GUI nor exposed in IPMI interface (e.g. ipmitool sdr elist).
In order to disable sensor or change thresholds via IPMI, sensor must be exposed via IPMI interface. One cannot edit SDR-non-existent sensors via ipmitool (unless maybe some raw low-level commands ?)
The closest number in SDR list (or GUI) that exisits is 93h (sensor CPU_THERMTRIP) but this sensor isn’t logging any events at all.
Events come from non-existent sensor 94h so they are displayed as Processor/Unknown in GUI.
Here are the example dumps of the event log:
ID | TimeStamp | Sensor Name | Sensor Type | Description
======|======================|==================|====================================|================================================================
3639 | 07/15/2024, 07:12:16 | Unknown | Processor | Thermal Trip - Asserted
I was scratching my head what sensor 94h was because even Linux “sensors” command could not display any CPU sensor that would have abnormal temps.
Only Windows HWinfo could display CPU temp sensor that had abnormal values and I quickly found correlation between this sensor (“CPU die average”) going over 70 C deg and events from “unknown” Processor sensor 94h being logged by BMC.
I repordted yet another issue to Asrock Rack support (now using English/USA region) with even greater level of detail, hopefully they will reply now. My previous case has been unanswered for over a week.
If you don’t mind, can you post the output of ipmitool sdr elist all?
It probably won’t show up there, but it is slightly more comprehensive than the default (without all).
No, I could not find one. Everyhing is working fine except these Thermal Trip alerts streaming in IPMI log. I opened a few tickets to AsRock support about it, even to different regions. All remained unanswered - I only received autoresponses that they got my ticket and would respond within a few days and that primary contact for all issues is point of sale (!).
From that perspective (quality of support) I would not recommend AsRock Rack products.