In our environment, two TrueNas Core servers have been running for years. No hardware changes occurred. Recently, both servers repeatedly report high temperature of 50C - 69C on 2 mirrored boot disks (NVMe SSDs) of each server at a random time between 6pm - 10pm, with a duration of ~10m.
Any idea what could be a root cause? Any idea on approaches to debug? Thank you!
hmmm…., actually inexplicable for me, what could cause a temperature increase (apart from SSD activity)…
Just shooting in the dark:
either the sensors have their” five-minute moment”…, always around the same time in the morning…
or the SSD controllers are doing internal things that don’t show up externally as activity (read/write).
In the storage dashboard you can enable/disable auto trim for a pool… but this is a boot pool & I’m 99% certain that you can’t enable auto trim for boot pools, so I think that is out of the window too.
I know that if you go to System → Boot → Stats/Settings there is a default scrub option of 7 days. I’d double check on that to see that it didn’t randomly get set to something else?
Edit: Any chance something is blowing hot air on them during that time? If not I guess the solution could be to slap a heatsink on them, otherwise I’m out of ideas on this mystery.
Also I noticed you’re on Core - some of my steps are for Scale… Sorry if screenshots don’t match Core UI, I’m still half sure that these are reasonably similar to what they used to be on Core.
You mentioned that “monitoring via SNMP does not show any disk read/write activity”.
Have you also checked the TrueNAS Core internal UI (“Reporting”)?
As far as I know, there can be differences between what SNMP reads and what FreeBSD/TrueNAS reads internally. They don’t always expose the same sensors. Also, the TrueNAS UI smooths its data over time, so short spikes or brief internal activity may not be visible there at all.
As @Alister already mentioned, I’m also leaning to the theory that some SSD-internal housekeeping processes are running (garbage collection, wear leveling, block erase, consolidation, etc.). These run entirely inside the SSD controller, so no external read/write activity shows up.
When exactly a controller decides to run these tasks is not really transparent. They can occur in rough 24-hour cycles, based on power-on-hour counters, TRIM events, or simply during periods of relative inactivity.
If that is indeed the cause, the real question is why this only started happening after years… or whether it has simply gone unnoticed until now.