How to debug Truenas Scale as it becomes unresponsive from time to time

ChristBKK · May 29, 2024, 2:20am

Hi there,

I am using Truenas Scale for the last months and after some time I got used to it. Normally all runs smoothly and I have no problems.

Lately and to be honest maybe since I started using Truenas Scale gets unresponsive from time to time. I can’t connect to the WebGUI anymore but also the wired ethernet connection is not “active” connected to the Router anymore and even when I connect a monitor/keyboard it doesn’t show anything (doesn’t get a signal).

This all happens from time to time and sometimes I have a uptime of 3 days but somtimes even 14 days… my longest uptime streak was around 2 weeks.

I first thought it’s something with my Router but as I don’t get any response from the monitor/keyboard as well I am thinking it has todo with the System/Hardware itself.

My question now, how can I debug this and see some “older” log files before the system freezes? Are they saved somewhere? Because I found some commands but they show only the most recent log?

I am really a bit out of thoughts what todo and only a power circle helps in the end to reboot the computer. It’s all new hardware and it runs all smoothly till it freezes. Temps are all okay I check them frequently.

Thanks a lot if you have any idea how to debug a problem like that.

ABain · May 29, 2024, 10:29am

What version are you running?

ChristBKK · May 29, 2024, 10:46am

TrueNAS-SCALE-23.10.2

ABain · May 29, 2024, 10:52am

Have you checked the logs in the debug file, Settings > Advanced >Save debug

ChristBKK · May 29, 2024, 11:15am

Thanks a lot that is what I was searching.

I opened the Error log and it shows error when the system froze.

Now the question is what happened

May 29 08:22:14 truenas kernel: ixgbe 0000:04:00.0: Adapter removed
May 29 08:22:15 truenas kernel: ixgbe 0000:04:00.1: Adapter removed
May 29 09:10:00 truenas kernel: hid-generic 0003:1532:028D.0004: No inputs registered, leaving
May 29 09:10:01 truenas kernel: Error: Driver ‘pcspkr’ is already registered, aborting…
May 29 09:10:01 truenas kernel:
May 29 09:10:01 truenas kernel: NVRM: The NVIDIA GeForce GT 710 GPU installed in this system is
NVRM: supported through the NVIDIA 470.xx Legacy drivers. Please
NVRM: visit Unix Drivers | NVIDIA for more
NVRM: information. The 535.54.03 NVIDIA driver will ignore
NVRM: this GPU. Continuing probe…
May 29 09:10:29 truenas blkmapd[2602]: open pipe file /run/rpc_pipefs/nfs/blocklayout failed: No such file or directory
May 29 09:10:29 truenas systemd[1]: Failed to start nslcd.service - LSB: LDAP connection daemon.
May 29 09:10:32 truenas libvirtd[2934]: invalid argument: cannot find architecture arm
May 29 09:10:32 truenas haproxy[4946]: backend be_20 has no server available!
May 29 09:10:33 truenas haproxy[4946]: backend be_32 has no server available!
May 29 09:10:33 truenas haproxy[5267]: backend be_20 has no server available!
May 29 09:11:00 truenas kernel: NVRM: The NVIDIA GeForce GT 710 GPU installed in this system is
NVRM: supported through the NVIDIA 470.xx Legacy drivers. Please
NVRM: visit Unix Drivers | NVIDIA for more
NVRM: information. The 535.54.03 NVIDIA driver will ignore
NVRM: this GPU. Continuing probe…
May 29 09:11:22 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:22 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:22 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:22 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:22 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:22 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:22 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:22 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:24 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:24 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:27 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:27 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:27 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:27 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:27 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available
May 29 09:11:27 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available

ChristBKK · May 29, 2024, 11:21am

I think it’s actually this one as the rest was my attempt to connect my display to the server.

May 29 08:22:14 - truenas kernel: ixgbe 0000:04:00.0: Adapter removed

So it seems this is connected to my NIC which I installed. Guess I have to start searching there or use the internal LAN again which worked without problems. I got me an Intel x520 Dual NIC but guess it has problems or is the cheap chinese variant everyone warning about.

@ABain again thanks a lot I was exactly searching that error log to debug and find some clues

dasdreHmomenT · September 9, 2024, 6:56pm

Hi everyone,

I had the same symptoms for the second time a couple of days ago: Server going dark, no network activity, no output on the DP, not reachable, all the reporting stops there when I look back. Only way to get it back is a hard reset.
As I said, this was the second time, the first time was a couple of weeks earlier. Aside from that the server was running smoothly since mid July.

Other than @ChristBKK I did not find anything in the logs: The last message in the messages log is 15 hours earlier. Nothing on this day at all in kern.log.
In error log the last entry is over nine hours before the system going dark at approximately 15:30:

Sep  6 02:26:42 wutzi systemd[1]: Failed to start apt-daily.service - Daily apt download activities.
Sep  6 06:00:43 wutzi systemd[1]: Failed to start apt-daily-upgrade.service - Daily apt upgrade and clean activities.

In syslog the last message before going dark is:

Sep  6 15:17:01 wutzi CRON[1702789]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep  6 15:20:07 wutzi systemd[1]: Starting sysstat-collect.service - system activity accounting

So, I don’t actually see anything wrong there.
Does anyone have an idea where to look next?
Did you find the cause of your problem, @ChristBKK ?

The system is a repurposed Lenovo ThinkServer with a Xeon E3-1225 v6, 64GB ECC Ram, a LSI 9211-8i flashed to IT mode (originally Dell PERC H310) and 3 SSDs (2 in a mirror for apps and scratch, 1 for the boot disk) and 7 2,5’’ HDDs (5 in a raidz2, 2 in a mirror).

Kind regards,
ht

dasdreHmomenT · September 15, 2024, 6:39pm

Hi again,

it happened again, not even one week from the last time.
And again I could not find anything in the logs to find out what happened. I have turned the log level to “debug”, but only afterwards. Hopefully that logs something of interest if it happens again.

Has anyone else seen something like this, the system just going completely dark but without turning off or logging any errors?

Kind regards,
ht

i8degrees · December 28, 2024, 1:17am

Hi,

After reading through your issue and realizing the particular version that you reported, it brought flooding to my mind a similar, perhaps the same event occuring with my install some time ago. As I will be pulling from my memory, forgive me for the details that surely are still locked away in my mind, but without further ado…

(Paraphrased) igxbe adapter removed
May 29 09:11:27 truenas kernel: IPVS: rr: UDP 172.17.0.10:53 - no destination available

I recall the last line above repeatedly appearing in my logs. The end result was that I was would lose connection to my TrueNAS applications (k3s, Docker). In addition to this, I wasn’t able to configure any bridges, such as for VMs nor even able to utilize the sandbox feature via JailMaker – this was the final deal breaker for me. Oh, and I think that my Ethernet adapter was using the r8169 driver, or perhaps even r8125.

I believe that I ultimately had to downgrade my kernel version to version 6.1 ish? Or perhaps even 5.15.x. I cannot recall precisely the version, but I know it was before the version that is the default in 23.10.2. Your milage will likely differ as we probably have entirely different hardware setup.

I cannot say how happy I was to eventually find a version of TrueNAS Scale that allowed me to stop building my own kernel via the git repos provided by TrueNAS! I believe that I held onto my custom kernel all the way until 24.x – with several 23.x updates before, such as 23.10.10 and so on.

All this madness prompted me to migrate my Applications from the official GUI method back to Docker runtime. (Now I am almost done having relocated all of it to another VM!)

Honestly, if you continue to have this issue – or perhaps you did fix it? Do whatever you can to update to 24.x ASAP. Rolling your own kernel is not for the weary minded – I have many years of experience with that, so it wasn’t the end of the world for me – but I really wasn’t expecting to ever need to do this to begin with! Anyhow, just my two cents.

Oh, lastly… What does your log file at /var/log/middlewared.log show? Personally, I have found most of my woes to relate back to this subsystem in one way or another. Often when the system becomes unresponsive, I can tie it back to that service because of the CPU time spiking like crazy from there! Generally stuck in a recursion loop…

I wasn’t aware of the log file path during the time that I ran the custom kernel -_- such is life! You live and learn.

Anywho, hopefully my comments help somebody out there…

ChristBKK · December 28, 2024, 1:35am

I switched back to the internal motherboard NIC (Realtek) and took out the x520 NIC. No problems anymore and uptimes over 120 days (before I update the server sometimes which resets the uptime)

Honestly the x520 is just not working reliable with Truenas and I accepted that. Have to go with the newer versions of that NIC imo

Okedokey · December 28, 2024, 7:59am

LambSauce · January 10, 2025, 12:23pm

Hi everyone.

I had a similar issue, also on TrueNAS-SCALE-23.10.2.
System became unresponsive, no network traffic and inaccesible until reboot.
The wierd part is that the uptime counter would reset every 4-5 days but i didn’t get an alert that the system rebooted. If i rebooted the system manually it did sent out the email so i don’t know if it even restarted or not.
1-2 months later i got like 10 chechsum on all drives and i was like “ohh crap”.
Tried everything and narrowed it down to the cpu (ryzen 5 3600). Replaced it (Ryzen 5 PRO 4650G), have no problems for a month now.
The wierd part is that the old cpu works perfectly fine in windows after hours of stress test so i have no idea what’s wrong with it.

dasdreHmomenT · March 8, 2025, 12:39pm

Just an update from my part: Disabling C-States in BIOS resolved the problem (sadly not before I tried various other things that cost money).
I don’t know why but at least on my repurposed Lenovo with the Xeon E3-1225 v6 and C-States enabled it would just sporadically go dark, mostly during or after some heavy I/O (which led me to believe the proprietary PSU was too weak (which it probably is, but there is no cost-conscious way to exchange it)).
I have also updated BIOS/Microcode (which is ridiculously non-trivial for Lenovos if you don’t have Windows) as it was a bit behind but I haven’t yet tested with C-States re-enabled, maybe some time.