Unresponsive system

Hello everyone,

Hoping someone can help me here. I’ve been experiencing random unresponsive systems and I’m not sure if it’s due to actual hardware issues or a mismatch between the kernel and the underlying hardware.
All (stress)tests (CPU and memory) are green after running multiple times and different durations.

What I can see in the logs right before the system becomes unresponsive:

Dec 01 01:57:17 truenas dhclient[2805]: No DHCPOFFERS received.
Dec 01 01:57:17 truenas dhclient[2805]: No working leases in persistent database - sleeping.
Dec 01 02:00:00 truenas dhclient[2792]: DHCPDISCOVER on enp5s0 to 255.255.255.255 port 67 interval 8
Dec 01 02:00:01 truenas systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Dec 01 02:00:01 truenas systemd[1]: sysstat-collect.service: Deactivated successfully.
Dec 01 02:00:01 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Dec 01 02:00:06 truenas dhclient[2834]: DHCPREQUEST for 192.168.0.133 on enp7s0 to 192.168.0.1 port 67
Dec 01 02:00:06 truenas dhclient[2834]: DHCPACK of 192.168.0.133 from 192.168.0.1
Dec 01 02:00:06 truenas dhclient[2834]: bound to 192.168.0.133 -- renewal in 1257 seconds.
Dec 01 02:00:08 truenas dhclient[2792]: DHCPDISCOVER on enp5s0 to 255.255.255.255 port 67 interval 18
Dec 01 02:00:26 truenas dhclient[2792]: DHCPDISCOVER on enp5s0 to 255.255.255.255 port 67 interval 17
Dec 01 02:00:43 truenas dhclient[2792]: DHCPDISCOVER on enp5s0 to 255.255.255.255 port 67 interval 17
Dec 01 02:01:00 truenas dhclient[2792]: DHCPDISCOVER on enp5s0 to 255.255.255.255 port 67 interval 1
Dec 01 02:01:01 truenas dhclient[2792]: No DHCPOFFERS received.
Dec 01 02:01:01 truenas dhclient[2792]: No working leases in persistent database - sleeping.
Dec 01 02:01:59 truenas kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Dec 01 02:01:59 truenas kernel: rcu:         2-...!: (2 GPs behind) idle=7808/0/0x0 softirq=826928/826928 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu:         3-...!: (1 GPs behind) idle=d5e8/0/0x0 softirq=800749/800750 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu:         5-...!: (7 GPs behind) idle=8708/0/0x0 softirq=849775/849775 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu:         6-...!: (1 GPs behind) idle=b1b0/0/0x0 softirq=976752/976752 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu:         7-...!: (2 GPs behind) idle=5548/0/0x0 softirq=840242/840242 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu:         10-...!: (0 ticks this GP) idle=2d08/0/0x0 softirq=776747/776747 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu:         11-...!: (2 GPs behind) idle=55d0/0/0x0 softirq=759279/759279 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu:         13-...!: (0 ticks this GP) idle=d448/0/0x0 softirq=830465/830465 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu:         14-...!: (1 GPs behind) idle=4f30/0/0x0 softirq=882860/882860 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu:         15-...!: (0 ticks this GP) idle=c160/0/0x0 softirq=879266/879266 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu:         (detected by 8, t=5254 jiffies, g=4610813, q=448 ncpus=16)
Dec 01 02:01:59 truenas kernel: Sending NMI from CPU 8 to CPUs 2:
Dec 01 02:01:59 truenas kernel: Sending NMI from CPU 8 to CPUs 3:
Dec 01 02:01:59 truenas kernel: Sending NMI from CPU 8 to CPUs 5:
Dec 01 02:01:59 truenas kernel: ------------[ cut here ]------------
Dec 01 02:01:59 truenas kernel: WARNING: CPU: 12 PID: 339994 at kernel/time/hrtimer.c:1050 hrtimer_forward+0x7b/0xc0
Dec 01 02:01:59 truenas kernel: Modules linked in: squashfs(E) xt_nat(E) xt_tcpudp(E) veth(E) xt_conntrack(E) nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nf_conntrack_netlink(E) nf_c>
Dec 01 02:01:59 truenas kernel:  watchdog(E) button(E) sg(E) loop(E) drm(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) zfs(POE) spl(OE) efivarfs(E) hid_generic(E) >
Dec 01 02:01:59 truenas kernel: CPU: 12 PID: 339994 Comm: Kestrel Timer Tainted: P           OE      6.6.44-production+truenas #1
Dec 01 02:01:59 truenas kernel: Hardware name: Gigabyte Technology Co., Ltd. AX370-Gaming K7/AX370-Gaming K7, BIOS F53d 09/02/2024
Dec 01 02:01:59 truenas kernel: RIP: 0010:hrtimer_forward+0x7b/0xc0
Dec 01 02:01:59 truenas kernel: Code: 7f 48 0f 4d f1 49 39 f0 4d 0f 4c c1 48 01 ca 4c 39 d1 49 0f 4c ca 4c 89 47 18 48 39 ca 49 0f 4c d1 48 89 57 20 e9 c5 25 a8 00 <0f> 0b e9 be 25 a>
Dec 01 02:01:59 truenas kernel: RSP: 0018:ffffb4d00527fb50 EFLAGS: 00010002
Dec 01 02:01:59 truenas kernel: RAX: 0000000000000000 RBX: ffff94b05aea12d8 RCX: 0000000000000018
Dec 01 02:01:59 truenas kernel: RDX: 0000000005f5e100 RSI: 00002dd513c3974f RDI: ffff94b05aea1318
Dec 01 02:01:59 truenas kernel: RBP: ffff94b05aea1318 R08: 00002ddf09831410 R09: 0000000000000001
Dec 01 02:01:59 truenas kernel: R10: 0000000000000000 R11: 00000009f5bf7cc1 R12: 0000000005f5e100
Dec 01 02:01:59 truenas kernel: R13: ffff94b05aea12d8 R14: ffffffffffffd770 R15: 0000000000000000
Dec 01 02:01:59 truenas kernel: FS:  00007f5b47ca9b30(0000) GS:ffff94b75ed00000(0000) knlGS:0000000000000000
Dec 01 02:01:59 truenas kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 01 02:01:59 truenas kernel: CR2: 000000c0004cb010 CR3: 00000001fd044000 CR4: 00000000003506e0
Dec 01 02:01:59 truenas kernel: Call Trace:
Dec 01 02:01:59 truenas kernel:  <TASK>
Dec 01 02:01:59 truenas kernel:  ? hrtimer_forward+0x7b/0xc0
Dec 01 02:01:59 truenas kernel:  ? __warn+0x81/0x130
Dec 01 02:01:59 truenas kernel:  ? hrtimer_forward+0x7b/0xc0
Dec 01 02:01:59 truenas kernel:  ? report_bug+0x171/0x1a0
Dec 01 02:01:59 truenas kernel:  ? handle_bug+0x41/0x70
Dec 01 02:01:59 truenas kernel:  ? exc_invalid_op+0x17/0x70
Dec 01 02:01:59 truenas kernel:  ? asm_exc_invalid_op+0x1a/0x20
Dec 01 02:01:59 truenas kernel:  ? hrtimer_forward+0x7b/0xc0
Dec 01 02:01:59 truenas kernel:  start_cfs_bandwidth.part.0+0x33/0x50
Dec 01 02:01:59 truenas kernel:  __account_cfs_rq_runtime+0x8b/0x100
Dec 01 02:01:59 truenas kernel:  dequeue_entity+0x38/0x3f0
Dec 01 02:01:59 truenas kernel:  ? srso_return_thunk+0x5/0x5f
Dec 01 02:01:59 truenas kernel:  dequeue_task_fair+0xc2/0x3f0
Dec 01 02:01:59 truenas kernel:  __schedule+0x5c1/0x950
Dec 01 02:01:59 truenas kernel:  ? srso_return_thunk+0x5/0x5f
Dec 01 02:01:59 truenas kernel:  ? hrtimer_start_range_ns+0x246/0x350
Dec 01 02:01:59 truenas kernel:  schedule+0x5b/0xa0
Dec 01 02:01:59 truenas kernel:  futex_wait_queue+0x64/0x90
Dec 01 02:01:59 truenas kernel:  futex_wait+0x189/0x270
Dec 01 02:01:59 truenas kernel:  ? __pfx_hrtimer_wakeup+0x10/0x10

This is on Truenas SCALE bare metal. The crashes were more frequent when
I tried to use Proxmox (tried with different kernel versions) with Truenas virtualized. The logs were:

Nov 30 09:25:56 prx02 kernel: mce: [Hardware Error]: Machine check events logged
Nov 30 09:25:56 prx02 kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 0: baa0000000060135
Nov 30 09:25:56 prx02 kernel: fbcon: Taking over console
Nov 30 09:25:56 prx02 kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 2d030800 IPID b000000000 
Nov 30 09:25:56 prx02 kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1732955151 SOCKET 0 APIC d microcode 8001138

System specs:
TrueNAS version: ElectricEel-24.10.0.2
Motherboard: Gaming AX370 gaming k7 (on the latest BIOS f53d)
CPU: Ryzen 1700x
RAM: Corsair Vengeance LPX DDR4 RAM 32GB
PSU: 650W gold rated
Peripherals: PCIE Sata controller, 2.5GB intel NIC, 3x500GB SSD’s, 1x4TB HDD

Let me know if more information is needed!

Thanks in advance!

1st gen ryzen need some bios tweeks to work reliably. Go to bios and disable the following settings:
for older bios:
erp-ready
amd cool&quit
global c-states

for newer bios
switch power supply option to typical idle current

1 Like

Thanks for your quick response! I did just that.
Let’s see how stable it runs! :crossed_fingers: