Support multiple Temperature Sensors in NVMe monitoring

Problem/Justification
Certain NVMe drives, such as Samsung SSD 970 EVO Plus, report Temperature, Temperature Sensor 1, and Temperature Sensor 2. Temperature is the same as Temperature Sensor 1, but Temperature Sensor 2 is the controller, which is more susceptible to heat issues.

Impact
Adding the Temperature Sensor data to Reporting/Netdata/Temperature Alerts will allow system administrators a better view of issues related to the controller.

User Story
Temperature Alerts would trigger on either Temperature Sensor, Netdata and Reporting will provide history and graphing for multiple items, instead of the single Temperature.

1 Like

Unfortunately this is not the case with all NVMe drives. My HP drive high temp is Sensor 1.

The goal should be to set a threshold well below the maximum value and use that. My allowed max temp is 87C, however my alarm temp is 55C. The temperature hovers around 45C when the system is really active.

I would say that this feature request would be better proposed to have TrueNAS read both sensors and alarm on the highest value between them both.

In the meantime, you can use my little script Multi-Report to monitor these temperatures. I donā€™t have it user configurable to select either sensor, however I could modify it to read both sensors if present and alarm on the highest value. Hum, an idea and not a bad one. The new version of Multi-Report is in Beta and making fine adjustments right now.

Cheers

1 Like

Your suggestion makes sense to me.

We will need to wait for full NVMe support and then see how TrueNAS supports this. Maybe it will become the highest of the two sensors. If not, we can revisit this again later. Right now ā€˜smartdā€™ in smartmontools 7.4 does not support nvme drives very well. there is an update in smartmontool 7.5, but that could be a while before it is released. And I am looking right now into incorporating reading both sensors into Multi-Report 3.1.

1 Like

After looking into it, the NVMe drive reports all 3 values which are ā€œcurrent temperatureā€ in addition to the ā€œSensor 1ā€ and ā€œSensor 2ā€ temperature values.

The manufacturer decides on which temperature value to use when reporting the ā€œcurrent temperatureā€ value. I will leave the script as-is for now and think about a way I could add this without too much change to the script. It sounds like an easy change and in one way, it is however that one small change affects several other areas of the script. It snowballs would be a good way to say it. If I do make a change, I highly doubt it will be in version 3.1 as it is in final testing now and any changes I make ā€œshouldā€ be only to fix any bug I created. But I still like the idea, I just need to figure out how to display it and there be consistency.

As for what TrueNAS will do, I strongly suspect they will also use the one value ā€œcurrent temperatureā€.

2 Likes

Thanks for looking into it. Those snowballs can get pretty big pretty fast!

1 Like

Done. The Snowball did happen, at first I added it to my CORE setup which has an NVMe which does report Sensor 1 and Sensor 2. Worked immediately. Then I tested on my SCALE system and all my NVMe drives there do not have an entry for Sensor 1 or Sensor 2, needless to say, a few error messages. 30 minutes later and it was all fixed and working on both CORE and SCALE with my NVMe drives. If the values are present, they will be reflected below the normal reported temperature, if not present (or a zero value) then you will not see any difference from todays current report style. Overall it actually was an easy change. I do not track that data, meaning it will not be available in the statistical data file. That does add a bit of complexity and not worth reworking at this point in time.

It already does. I get alerts when my Samsung NVMe exceeds 70C for Sensor 2, even though Sensor 1 stays below 40C.

In the Reporting page, however, it only shows the readings for Sensor 1. In fact, it doesnā€™t even call it ā€œSensor 1ā€. It just calls it ā€œDisk Temperatureā€.

So when I get an alert that ā€œnvme0 exceeded 70Cā€, according to the GUI the drive is supposedly under 40C. :smile: The only way to check the temperature for Sensor 2 is to use nvmecontrol in the command-line.

3 Likes