Hello everyone,
Hoping someone can help me here. I’ve been experiencing random unresponsive systems and I’m not sure if it’s due to actual hardware issues or a mismatch between the kernel and the underlying hardware.
All (stress)tests (CPU and memory) are green after running multiple times and different durations.
What I can see in the logs right before the system becomes unresponsive:
Dec 01 01:57:17 truenas dhclient[2805]: No DHCPOFFERS received.
Dec 01 01:57:17 truenas dhclient[2805]: No working leases in persistent database - sleeping.
Dec 01 02:00:00 truenas dhclient[2792]: DHCPDISCOVER on enp5s0 to 255.255.255.255 port 67 interval 8
Dec 01 02:00:01 truenas systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Dec 01 02:00:01 truenas systemd[1]: sysstat-collect.service: Deactivated successfully.
Dec 01 02:00:01 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Dec 01 02:00:06 truenas dhclient[2834]: DHCPREQUEST for 192.168.0.133 on enp7s0 to 192.168.0.1 port 67
Dec 01 02:00:06 truenas dhclient[2834]: DHCPACK of 192.168.0.133 from 192.168.0.1
Dec 01 02:00:06 truenas dhclient[2834]: bound to 192.168.0.133 -- renewal in 1257 seconds.
Dec 01 02:00:08 truenas dhclient[2792]: DHCPDISCOVER on enp5s0 to 255.255.255.255 port 67 interval 18
Dec 01 02:00:26 truenas dhclient[2792]: DHCPDISCOVER on enp5s0 to 255.255.255.255 port 67 interval 17
Dec 01 02:00:43 truenas dhclient[2792]: DHCPDISCOVER on enp5s0 to 255.255.255.255 port 67 interval 17
Dec 01 02:01:00 truenas dhclient[2792]: DHCPDISCOVER on enp5s0 to 255.255.255.255 port 67 interval 1
Dec 01 02:01:01 truenas dhclient[2792]: No DHCPOFFERS received.
Dec 01 02:01:01 truenas dhclient[2792]: No working leases in persistent database - sleeping.
Dec 01 02:01:59 truenas kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Dec 01 02:01:59 truenas kernel: rcu: 2-...!: (2 GPs behind) idle=7808/0/0x0 softirq=826928/826928 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu: 3-...!: (1 GPs behind) idle=d5e8/0/0x0 softirq=800749/800750 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu: 5-...!: (7 GPs behind) idle=8708/0/0x0 softirq=849775/849775 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu: 6-...!: (1 GPs behind) idle=b1b0/0/0x0 softirq=976752/976752 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu: 7-...!: (2 GPs behind) idle=5548/0/0x0 softirq=840242/840242 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu: 10-...!: (0 ticks this GP) idle=2d08/0/0x0 softirq=776747/776747 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu: 11-...!: (2 GPs behind) idle=55d0/0/0x0 softirq=759279/759279 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu: 13-...!: (0 ticks this GP) idle=d448/0/0x0 softirq=830465/830465 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu: 14-...!: (1 GPs behind) idle=4f30/0/0x0 softirq=882860/882860 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu: 15-...!: (0 ticks this GP) idle=c160/0/0x0 softirq=879266/879266 fqs=0 (false positive?)
Dec 01 02:01:59 truenas kernel: rcu: (detected by 8, t=5254 jiffies, g=4610813, q=448 ncpus=16)
Dec 01 02:01:59 truenas kernel: Sending NMI from CPU 8 to CPUs 2:
Dec 01 02:01:59 truenas kernel: Sending NMI from CPU 8 to CPUs 3:
Dec 01 02:01:59 truenas kernel: Sending NMI from CPU 8 to CPUs 5:
Dec 01 02:01:59 truenas kernel: ------------[ cut here ]------------
Dec 01 02:01:59 truenas kernel: WARNING: CPU: 12 PID: 339994 at kernel/time/hrtimer.c:1050 hrtimer_forward+0x7b/0xc0
Dec 01 02:01:59 truenas kernel: Modules linked in: squashfs(E) xt_nat(E) xt_tcpudp(E) veth(E) xt_conntrack(E) nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nf_conntrack_netlink(E) nf_c>
Dec 01 02:01:59 truenas kernel: watchdog(E) button(E) sg(E) loop(E) drm(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) zfs(POE) spl(OE) efivarfs(E) hid_generic(E) >
Dec 01 02:01:59 truenas kernel: CPU: 12 PID: 339994 Comm: Kestrel Timer Tainted: P OE 6.6.44-production+truenas #1
Dec 01 02:01:59 truenas kernel: Hardware name: Gigabyte Technology Co., Ltd. AX370-Gaming K7/AX370-Gaming K7, BIOS F53d 09/02/2024
Dec 01 02:01:59 truenas kernel: RIP: 0010:hrtimer_forward+0x7b/0xc0
Dec 01 02:01:59 truenas kernel: Code: 7f 48 0f 4d f1 49 39 f0 4d 0f 4c c1 48 01 ca 4c 39 d1 49 0f 4c ca 4c 89 47 18 48 39 ca 49 0f 4c d1 48 89 57 20 e9 c5 25 a8 00 <0f> 0b e9 be 25 a>
Dec 01 02:01:59 truenas kernel: RSP: 0018:ffffb4d00527fb50 EFLAGS: 00010002
Dec 01 02:01:59 truenas kernel: RAX: 0000000000000000 RBX: ffff94b05aea12d8 RCX: 0000000000000018
Dec 01 02:01:59 truenas kernel: RDX: 0000000005f5e100 RSI: 00002dd513c3974f RDI: ffff94b05aea1318
Dec 01 02:01:59 truenas kernel: RBP: ffff94b05aea1318 R08: 00002ddf09831410 R09: 0000000000000001
Dec 01 02:01:59 truenas kernel: R10: 0000000000000000 R11: 00000009f5bf7cc1 R12: 0000000005f5e100
Dec 01 02:01:59 truenas kernel: R13: ffff94b05aea12d8 R14: ffffffffffffd770 R15: 0000000000000000
Dec 01 02:01:59 truenas kernel: FS: 00007f5b47ca9b30(0000) GS:ffff94b75ed00000(0000) knlGS:0000000000000000
Dec 01 02:01:59 truenas kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 01 02:01:59 truenas kernel: CR2: 000000c0004cb010 CR3: 00000001fd044000 CR4: 00000000003506e0
Dec 01 02:01:59 truenas kernel: Call Trace:
Dec 01 02:01:59 truenas kernel: <TASK>
Dec 01 02:01:59 truenas kernel: ? hrtimer_forward+0x7b/0xc0
Dec 01 02:01:59 truenas kernel: ? __warn+0x81/0x130
Dec 01 02:01:59 truenas kernel: ? hrtimer_forward+0x7b/0xc0
Dec 01 02:01:59 truenas kernel: ? report_bug+0x171/0x1a0
Dec 01 02:01:59 truenas kernel: ? handle_bug+0x41/0x70
Dec 01 02:01:59 truenas kernel: ? exc_invalid_op+0x17/0x70
Dec 01 02:01:59 truenas kernel: ? asm_exc_invalid_op+0x1a/0x20
Dec 01 02:01:59 truenas kernel: ? hrtimer_forward+0x7b/0xc0
Dec 01 02:01:59 truenas kernel: start_cfs_bandwidth.part.0+0x33/0x50
Dec 01 02:01:59 truenas kernel: __account_cfs_rq_runtime+0x8b/0x100
Dec 01 02:01:59 truenas kernel: dequeue_entity+0x38/0x3f0
Dec 01 02:01:59 truenas kernel: ? srso_return_thunk+0x5/0x5f
Dec 01 02:01:59 truenas kernel: dequeue_task_fair+0xc2/0x3f0
Dec 01 02:01:59 truenas kernel: __schedule+0x5c1/0x950
Dec 01 02:01:59 truenas kernel: ? srso_return_thunk+0x5/0x5f
Dec 01 02:01:59 truenas kernel: ? hrtimer_start_range_ns+0x246/0x350
Dec 01 02:01:59 truenas kernel: schedule+0x5b/0xa0
Dec 01 02:01:59 truenas kernel: futex_wait_queue+0x64/0x90
Dec 01 02:01:59 truenas kernel: futex_wait+0x189/0x270
Dec 01 02:01:59 truenas kernel: ? __pfx_hrtimer_wakeup+0x10/0x10
This is on Truenas SCALE bare metal. The crashes were more frequent when
I tried to use Proxmox (tried with different kernel versions) with Truenas virtualized. The logs were:
Nov 30 09:25:56 prx02 kernel: mce: [Hardware Error]: Machine check events logged
Nov 30 09:25:56 prx02 kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 0: baa0000000060135
Nov 30 09:25:56 prx02 kernel: fbcon: Taking over console
Nov 30 09:25:56 prx02 kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 2d030800 IPID b000000000
Nov 30 09:25:56 prx02 kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1732955151 SOCKET 0 APIC d microcode 8001138
System specs:
TrueNAS version: ElectricEel-24.10.0.2
Motherboard: Gaming AX370 gaming k7 (on the latest BIOS f53d)
CPU: Ryzen 1700x
RAM: Corsair Vengeance LPX DDR4 RAM 32GB
PSU: 650W gold rated
Peripherals: PCIE Sata controller, 2.5GB intel NIC, 3x500GB SSD’s, 1x4TB HDD
Let me know if more information is needed!
Thanks in advance!