I can see from the dashboard that TN crashed at about 0300 this morning. It has recovered (one of my five apps didn’t restart successfully and a USB stick didn’t remount properly).
I had some email alerts, this first one at midnight (note that CRUZER128 is an unused Sandisk USB stick which I plugged in a few days ago to see if it would suit for some experiments with docker etc. but it has remained unused and unneeded):
The first email at 00:00
TrueNAS @ truenas
New alerts:
Pool CRUZER128 state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
I then had six successful backups to a second NAS elsewhere; these happened at 03:00 this morning as per the usual schedule.
The next email at 03:18:
TrueNAS @ truenas
New alert:
Pool CRUZER128 state is OFFLINE: None
The following alert has been cleared:
Pool CRUZER128 state is ONLINE: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
I have just run @joeschmuck 's Multi-Report (it runs every Friday and it was all OK but I thought I should check, seeing as it is no longer Friday!) and it again reports that all is well.
Other than that, I don’t know where to begin to find out why this happened, the first instance in three or four years of using TN of an unplanned restart.
I haven’t tried anything else, not wanting to mess up any TN logs etc. so I would very much welcome some guidance.
Thanks in advance for any simpleton-level guidance. If it helps, I would like to file a bug report if relevant to do so.
Successfully exported/disconnected CRUZER128. All data on that pool was destroyed
I hope to find out what caused the crash & reboot. I’d presume that a simple USB stick failure (in some manner) shouldn’t have taken the whole TrueNAS appliance with it, so I presume there’s another cause which either upset the USB stick and also caused a reboot, or the USB stick aspect is a coincidence (which seems unlikely).
Nah, it shouldn’t take down the whole system, but it’s unclear what the problem was. I don’t know if there’s some sort of odd hardware failure with it that could have caused a kernel panic.
The error could be the result of something else that caused the reboot.
Still, I’d start with removing unused hardware Hopefully resolves it, and if not we’ll have it ruled out…
USB drives (of whatever type) can disconnect and then (with the default failmode setting of wait) your TrueNAS system hangs. Alternative settings for failmode are panic which you probably don’t want, and continue which is probably what you do want.
It is unclear from the details you provided whether the system rebooted or simply recovered the USB connection, but I assume that you have checked the up-time and found it was a reboot. This is IMO likely to have been done by a hardware watchdog time noting that the system was unresponsive and rebooting it.
It’s interesting to learn what can happen with USB problems. I won’t be using sticks to experiment with again! (I knew it was a bad idea and it’s not something i have tried before). Thanks for this overview.
What other details ought I to have provided? I thought the uptime and the change in CPU burden coinciding would be sufficient detail but where could I find something which is better proof of a reboot?
Perhaps I should have also posted this screenshot:
edit: by looking at less /var/log/messages I could see info being written every few seconds of every minute, up until this gap:
Mar 9 03:12:44 truenas kernel: br-629e16e0258b: port 1(vethc9e5e67) entered disabled state
Mar 9 03:17:14 truenas syslog-ng[3302]: syslog-ng starting up; version='3.38.1'
I presume that those missing four and a half minutes coincide with the crash & reboot (but unfortunately there’s no info in-between, of course, to show what caused it).
Thanks for these ideas. The only aspect which has changed is me using that CRUZER128 USB stick … the other h/w is unaltered (and the various official apps are presumably “docker isolated” and very unlikely to be responsible).
I think the best experiment, for now, is to leave things alone, not use any USB sticks (it was a bad idea) and see if it crashes again. I think it will be OK.
Are there any other logs which I ought to investigate? I suppose it is unrealistic to hope to find a log entry which says "USB failure - I'm going to crash soon" or similar “smoking gun” evidence!