Watchdog not working as expected

Hi all, new member here - thanks to the admins for approving my membership. I’ve been reading lots of forum content as I started my TrueNAS journey a couple of weeks ago - so many thanks to all those contributors, and I have some IT experience but am a relative beginner in the Linux space.

My system:
TrueNAS Scale Dragonfish-24.04.2.2 as barebone host (not using Virtualisation)
Motherboard - Supermicro X11SSH-LN4F
Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz
32GB matched DDR4 ECC 2400 RAM in 4 DIMMS
2 x 8TB WD Red NAS drives in mirrored pool
1 x 290GB drive as single drive pool for temp use

My problem:
After a few weeks of enjoying networked storage and SMBs for a separate stand-alone Plex Media Server without any issues. I’ve recently been experiencing TrueNAS and the underlying Debian OS locking up, with loss of the SMBs, no access to web GUI and the console not responding, although the IPMI GUI is still functional through it’s own NIC.

A check of var/log/messages - for TrueNAS and /var/log/syslog for Debian doesn’t produce any clues for the failures - if anyone else can point me at other useful logs, I can continue my fault finding, but, as my system is about as vanilla as you can get, I’ll run some memory and CPU soak tests and in the meantime, I was interested in trying to get the IPMI Watchdog working as the Supermicro website/manual suggests it should.

My understanding is there are two parts:a hardware Watchdog component - enabled in the BIOS and which should respect a board jumper - JWD1, and a software component built into the Debian OS - ipmitool.

From the descriptions I’ve read on here and other IPMI related posts, I understood that enabling the Watchdog in BIOS will result in the system restarting at 5mins after the Watchdog timer runs down, that concurs with my checks after booting up.

I understood that using ipmitool commands within the Debian shell, either at the console or from the TrueNAS system options, could interrogate and report on this BIOS Watchdog timer and reset it, and from the command help.

I can see that there are three main functions that I could get working:
ipmitool mc watchdog off - turns off the timer
ipmitool mc watchdog reset - resets it to the default or custom time setting
ipmitool mc watchdog get - reports on the status of the timer, e.g.

I think there’s an ipmitool mc watchdog set, with multiple options, but I couldn’t get the syntax to work, however the get command gives useful output:

user@truenas[~]$ ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x04)
Watchdog Timer Is: Stopped
Watchdog Timer Logging: On
Watchdog Timer Action: Power Cycle (0x03)
Pre-timeout interrupt: None
Pre-timeout interval: 0 seconds
Timer Expiration Flags: (0x10)
* SMS/OS
Initial Countdown: 300.0 sec
Present Countdown: 0.0 sec

However, the get report above does not reflect the BIOS timer if Watchdog is enabled in the BIOS, and the BIOS Watchdog happily restarts at 5mins after boot up regardless of what is set by ipmitool.

I was expecting to see the get report display the current BIOS default timer countdown as being the default 5mins less the boot-up time, e.g. 150secs, because as my system takes around 150secs to boot, but the ipmitool get command only shows the timer as set/reset by the ipmitool command.

It also seems that the BIOS timer takes priority and there’s no way to extend or reset it from the ipmitool, likewise, if the WATCHDOG is disabled in BIOS and I use ipmitool to set a timer, it will also implement the action in the Watchdog settings when its timer runs out regardless of the BIOS setting, in my case I’ve kept the default 300secs and for it to power cycle.

I have used the raw command to choose OS/SMS for the Timer Use: and Powercycle for my action, as I couldn’t get the ipmitool mc watchdog set command to work.

I think watchdog and the IPMI feature on the Supermicro boards are fantastic features and accessing the IPMI web gui separately to the host OS is really useful, but can’t understand why I can’t get the watchdog feature to work as I understand it should.

I’ve searched through this forum for Watchdog queries and other fora for IPMI/Watchdog questions, but haven’t found anyone that describes the same understanding of how it should work as I do, so I’m wondering if my understanding is wrong?

If anyone can help, it will help me better understand the capabilities and limitations of Debian and IPMI features?

Very many thanks

I’ve just noted that this may be better in the Hardware forum, if admins prefer and can advise how I can do it, i’ll happily move it there?

What is your boot drive?

Hi there, its a 2.5" SSD, capacity 128GB.

Regards

Connected to a SATA port?

it is connected to a SATA port, all my drives use the sata ports. Is there something about the boot drive being an SSD on SATA that causes watchdog a problem, am just trying to understand your line of thought on this?
Many thanks

I have experienced exactly these same symptoms, however mine is due to my boot drive being on a USB SSD - so I just wanted to check that this was not the cause of your issues (which it clearly isn’t).

What was happening on my system is that the USB was disconnecting, losing me access to the boot-pool, and then because the boot-pool parameter is set to wait until the pool reappears (rather than the other options to crash the O/S or fail the I/O with an error), the O/S hangs waiting for you to reconnect the boot pool.

The way I diagnosed this was to connect a monitor and looked to see what messages are on the console when it hangs. (If you have lost access to the boot pool, they won’t be written to syslog and so not visible when you have rebooted.)

I am not a user of IPMI or watchdog, but here is what I gleaned from a bit of research:

  1. ls -l /dev and see whether there is a directory or file called watchdog in there. If there is, then the Debian watchdog module has loaded (which is necessary). (On my Scale system it is loaded even though my hardware doesn’t have a watchdog function.)
  2. ls -l /dev and see whether there are directories or files called ipmi* in there. If there is, then Debian has recognised the hardware and loaded a driver.
  3. I have not found anything on how to configure a watchdog daemon in Scale, but it appears that it may be as simple as:
    • Enable Watchdog in BIOS with (say) a 3 minute timeout.
    • Create a Cron job to run once per minute like “echo Still up > /dev/watchdog”.

Thank you, yes I can see why a USB drive as a boot drive might create that same type of problem - your description makes sense.

I will take a look at your watchdog checks/suggestions, but as watchdog is a sticking plaster solution, I’ve been trying to investigate the root cause and although it’s early days, and at the risk of tempting fate, my TrueNAS box has now been up for over 48hrs.

If this thread helps anyone else, I found a reference to the BIOS C-states setting(s) during my research and at some point in this investigative journey, I had reset all BIOS settings to factory default which presumably enabled that option, if it had been disabled.

I’d read, and possibly misunderstood, that there are several levels of C-state sleep for the CPU and depending on how far it goes down the levels, there’s a chance it can’t undo each step to wake up.

The CPU going into a deep hibernation mode concurs with the symptoms I’ve seen, as the PCs IPMI interface remains alive and all MB temp sensors seem to be happy, but the OS doesnt respond, isn’t pingable and the TrueNAS Web GUI doesnt respond.

I’d noted that the console display continues displaying the last messages and the list of TrueNAS console options, e.g. Network adapter config, TrueNAS and System shells, etc. but the Supermicro boards use an BMC chip (AST Speed?) to generate, and presumably refresh the VGA with whatever the last screen output was, but the keyboard doesnt do anything.

So, disabling C-States may be a good call, it also presumably means the NAS HDDs continue to operate rather than spinning down, which should reduce the wear and tear of restarting.

As a safety net, I have, however, modified a shell process to write the first 10 lines from the TOP command to a new log file every minute so if/when the Server hangs, I should have a record of CPU and process activity upto and just before the crash.

My next considerations are; did disabling the C-states setting actually resolve my problem, or would my shell script running every 1 minute stop the CPU ever considering going to sleep anyway, i.e. might i have overruled the C-states setting with my process?

Im going to leave as is and if it doesnt hang after a week, I’ll firstly stop the shell script, run it for a while, then re-enable C-states again, run the script and see what the outcome is.

I think my final test might be to replace the CPU with a lower order Xeon and see if the processor itself might have been faulty.

I also have a second, SuperMicro X10SLH MB which I’m planning to use as a data replication server away from the house, so I could try my theories on that as a sandbox whilst the main server is happy to continue running.

In summary; these older Supermicro Motherboards are really good value for money, i.e. ECC memory for NAS write assurance, multiple ethernet adapters and a remotely accessible IPMI interface to check/monitor board sensors make them great candidates for a HomeLab/TrueNAS/OPNSense server but the BIOS features are understandably more comprehensive than those of Desktop/Gaming MBs and maybe there’s a compromise C-state level that I can enable to optimise energy use, without essentially putting it into an unwakeable sleep and which keeps the drive platters spinning, possibly slower, in some kind of low-energy mode.

I ideally need to find good SuperMicro training videos or material to help me understand what they all are, if anyone on the forum knows if they’re freely available in the public domain?

Thanks again, for now.

Yes - it definitely could be C states.