Boot disks report high temperature at a random time between 6pm - 10pm

In our environment, two TrueNas Core servers have been running for years. No hardware changes occurred. Recently, both servers repeatedly report high temperature of 50C - 69C on 2 mirrored boot disks (NVMe SSDs) of each server at a random time between 6pm - 10pm, with a duration of ~10m.

Any idea what could be a root cause? Any idea on approaches to debug? Thank you!

No software changes either ?
Are only the temperatures spiking?
No other signs of increased activity on that SSDs ?

No software changes either.

No other signs of increased activity on the SSDs.

No server CPU utilization increase during the periods of temperature abnormal occurrences

Only the temperatures spiking

hmmm…., actually inexplicable for me, what could cause a temperature increase (apart from SSD activity)…

Just shooting in the dark:
either the sensors have their” five-minute moment”…, always around the same time in the morning…
or the SSD controllers are doing internal things that don’t show up externally as activity (read/write).

Thanks for suggesting the possible causes.

Exactly as you mentioned, monitoring via SNMP does not show any disk read/write increase during the periods of temperature abnormal occurrences

No chance that you have a smart or scrub scheduled during this time?

It could be a TRIM - that wouldn’t show up in disk activity as the OS isn’t interacting with the drive(s)

How full are the drives?

I’d check model and firmware versions - and any SMART data. It could be that the drives are wear leveling / running out of cells?

Also you don’t need SSDs for boot SATA works fine

Thanks for recommending the possibilities. Neither scrub nor smart is scheduled during this time

Thanks for bringing up this possibility. The boot pool only used 7GiB with the size of 222GiB. No auto smart task is scheduled.

How to show scheduled TRIM tasks? How to check TRIM task logs/status/duration? Thanks!

In the storage dashboard you can enable/disable auto trim for a pool… but this is a boot pool & I’m 99% certain that you can’t enable auto trim for boot pools, so I think that is out of the window too.

I know that if you go to System → Boot → Stats/Settings there is a default scrub option of 7 days. I’d double check on that to see that it didn’t randomly get set to something else?

Beyond that, I’m out of ideas.

Edit: Any chance something is blowing hot air on them during that time? :stuck_out_tongue: If not I guess the solution could be to slap a heatsink on them, otherwise I’m out of ideas on this mystery.

Also I noticed you’re on Core - some of my steps are for Scale… Sorry if screenshots don’t match Core UI, I’m still half sure that these are reasonably similar to what they used to be on Core.

You mentioned that “monitoring via SNMP does not show any disk read/write activity”.
Have you also checked the TrueNAS Core internal UI (“Reporting”)?

As far as I know, there can be differences between what SNMP reads and what FreeBSD/TrueNAS reads internally. They don’t always expose the same sensors. Also, the TrueNAS UI smooths its data over time, so short spikes or brief internal activity may not be visible there at all.

As @Alister already mentioned, I’m also leaning to the theory that some SSD-internal housekeeping processes are running (garbage collection, wear leveling, block erase, consolidation, etc.). These run entirely inside the SSD controller, so no external read/write activity shows up.

When exactly a controller decides to run these tasks is not really transparent. They can occur in rough 24-hour cycles, based on power-on-hour counters, TRIM events, or simply during periods of relative inactivity.
If that is indeed the cause, the real question is why this only started happening after years… or whether it has simply gone unnoticed until now.

1 Like

Thanks for prompting the potential causes.

On our systems, auto trim is not turned on for boot-pool, checking with command

zpool get autotrim 

Double checking System → Boot → Stats/Settings confirms the setting of default scrub option of 7 days.

The screenshots for Scale are similar with Core. Thanks for taking the time capturing and posting here!

1 Like

Thanks again for your response.

Reporting on the TrueNAS Core internal UI does not show the high temperature spikes captured by monitoring via SNMP.

Thanks for recommending the approach of looking into SSD-internal housekeeping processes and digging why this only started happening after years.