Hi there,
I am reaching out to the community today to seek for help about a very strange behaviour I have been observing recently on my Truenas Scale bare metal system.
This box has been running for the past 4 years with no issue whatsoever, starting with Truenas Core on a brand new box that I assembled during the pandemic. I migrated to Scale when it was released and never looked back.
Hardware
Supermicro X11SCH
Intel(R) Xeon(R) E-2234 CPU @ 3.60GHz
64 Gb Memory
Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Networking, one port connected and configured as br0 to allow access to the host from a VM (following Setting Up a Network Bridge | TrueNAS Documentation Hub)
2 120Gb SSD in MIRROR boot volume
4 6 Tb disk in RAIDZ1 data volume
2 2Tb SSD in MIRROR high performance data volume
Software
Dragonfish-24.04.2.5
1 Virtual machine with following specs:
Virtual CPUs: 2
Cores: 2
Threads: 2
Memory Size: 16 GiB
Boot Loader Type: UEFI
System Clock: LOCAL
1x ACHI zvol (120 Gb)
1x VirtIO NIC attached to br0
1x 1920x1080 display
1x CD ROM driver with installation OS
On the issue:
The VM is running Fedora CoreOS version 41.20241027.3.0
Linux coreos 6.11.5-300.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Oct 22 20:11:15 UTC 2024 x86_64 GNU/Linux
The sole purpose of the VM is to act as a docker host, the OS is running nothing besides docker.
It is running 36 containers including portainer.
The container have various purposes and performance needs, some very low (mrtg, acme) some a little more (freshRSS, phpMyAdmin, alpine-cron, mariadb) and a few with more needs (gitlab, nextcloud, mastodon, transmission, borgmatic, synching).
This very VM has been running for the best of the 4 past years, first on core then on scale, with no issue whatsoever. Starting approximately 3 weeks ago, the VM started freezing at random times, usually when a higher perf container was doing something but not necessarily. It happened at times where borgmatig was running its backup check, when syncthing was running a consistency check but also with lower performance demanding tasks such as adding a torrent to transmission for a download (like happened today when I wanted do download ubuntu).
What happens is actually not that easy to describe but I would qualify it as a “freeze” in the sense of the VM becoming unresponsive, the Display attached to it when opened in the Truenas interface is frozen, the service no longer responding. Sometimes some will still respond but mostly nothing can be done anymore. No ability to ssh to the VM, portainer will sometimes still respond but won’t be able to stop or start any container.
I have tried both Host Passthrough and Host Model for the VM, what I noticed today though is that I was able to get some kind of debug trace in Host Model whereas in Host Passthrough there would be no noticeable information anywhere.
By reproducing today in Host Model and while I was already connected via SSH to the VM, I could, when I added my torrent to transmission, grab the following:
dmesg
Nov 17 15:41:58 coreos kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 78s! [kworker/u16:7:101418]
Nov 17 15:41:58 coreos kernel: CPU#2 Utilization every 4s during lockup:
Nov 17 15:41:59 coreos kernel: #1: 100% system, 0% softirq, 0% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #2: 100% system, 1% softirq, 1% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #3: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #4: 100% system, 0% softirq, 0% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: #5: 100% system, 0% softirq, 1% hardirq, 0% idle
Nov 17 15:41:59 coreos kernel: Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace veth xt_nat xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl virtio_balloon i2c_piix4 qxl drm_ttm_helper i2c_smbus ttm joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_net net_failover failover ata_generic pata_acpi nvme_tcp nvme_fabrics nvme_keyring nvme_core serio_raw nvme_auth be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua
Nov 17 15:41:59 coreos kernel: ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
Nov 17 15:41:59 coreos kernel: CPU: 2 UID: 0 PID: 101418 Comm: kworker/u16:7 Tainted: G L 6.11.5-300.fc41.x86_64 #1
Nov 17 15:41:59 coreos kernel: Tainted: [L]=SOFTLOCKUP
Nov 17 15:41:59 coreos kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Nov 17 15:41:59 coreos kernel: Workqueue: writeback wb_workfn (flush-0:279)
Nov 17 15:41:59 coreos kernel: RIP: 0010:nfs_page_group_unlock+0x28/0x40 [nfs]
Nov 17 15:41:59 coreos kernel: Code: 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb 48 8b 7f 50 48 39 df 74 05 e8 65 ff ff ff 48 8d 7b 38 f0 80 63 38 bf 48 8b 43 38 <f6> c4 10 75 06 5b c3 cc cc cc cc be 06 00 00 00 5b e9 e2 79 f1 dd
Nov 17 15:41:59 coreos kernel: RSP: 0018:ffffb8d1c9e4f820 EFLAGS: 00000206
Nov 17 15:41:59 coreos kernel: RAX: 0000000000000027 RBX: ffff97ae029a0800 RCX: ffffffffffffffe0
Nov 17 15:42:00 coreos kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff97ae029a0838
Nov 17 15:42:00 coreos kernel: RBP: ffff97ae029a0838 R08: 0000000000000018 R09: 0000000000000000
Nov 17 15:42:00 coreos kernel: R10: 0000000000000000 R11: 000000000003dc8c R12: ffff97ae029a0800
Nov 17 15:42:00 coreos kernel: R13: fffff1c5c01a0000 R14: 0000000000000001 R15: 0000000000000000
Nov 17 15:42:00 coreos kernel: FS: 0000000000000000(0000) GS:ffff97b22a500000(0000) knlGS:0000000000000000
Nov 17 15:42:00 coreos kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 17 15:42:00 coreos kernel: CR2: 00003c020063e000 CR3: 000000028a42a001 CR4: 0000000000370ef0
Nov 17 15:42:00 coreos kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 17 15:42:00 coreos kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 17 15:42:00 coreos kernel: Call Trace:
Nov 17 15:42:00 coreos kernel: <IRQ>
Nov 17 15:42:08 coreos kernel: ? watchdog_timer_fn.cold+0x233/0x311
Nov 17 15:42:08 coreos kernel: ? __pfx_watchdog_timer_fn+0x10/0x10
Nov 17 15:42:11 coreos kernel: ? __hrtimer_run_queues+0x113/0x280
Nov 17 15:42:15 coreos kernel: ? kvm_clock_get_cycles+0x18/0x30
Nov 17 15:42:19 coreos kernel: ? hrtimer_interrupt+0xfa/0x210
Nov 17 15:42:21 coreos kernel: ? __sysvec_apic_timer_interrupt+0x52/0x100
Nov 17 15:42:25 coreos kernel: ? sysvec_apic_timer_interrupt+0x6c/0x90
Nov 17 15:42:44 coreos kernel: </IRQ>
Nov 17 15:43:02 coreos kernel: <TASK>
journalctl
[ 8520.270364] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u16:7:101418]
[ 8520.271777] CPU#2 Utilization every 4s during lockup:
[ 8520.272344] #1: 84% system, 16% softirq, 1% hardirq, 0% idle
[ 8520.273036] #2: 98% system, 2% softirq, 1% hardirq, 0% idle
[ 8520.273717] #3: 100% system, 1% softirq, 0% hardirq, 0% idle
[ 8520.274405] #4: 100% system, 0% softirq, 1% hardirq, 0% idle
[ 8520.275130] #5: 100% system, 1% softirq, 1% hardirq, 0% idle
[ 8520.275821] Modules linked in: bluetooth rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace veth xt_nat xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat bridge stp llc nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype nft_compat nf_tables rfkill sunrpc intel_rapl_msr intel_rapl_common intel_pmc_core intel_vsec pmt_telemetry pmt_class kvm_intel kvm rapl virtio_balloon i2c_piix4 qxl drm_ttm_helper i2c_smbus ttm joydev nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci overlay erofs netfs loop xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 virtio_net net_failover failover ata_generic pata_acpi nvme_tcp nvme_fabrics nvme_keyring nvme_core serio_raw nvme_auth be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua
[ 8520.275890] ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg virtio_console
[ 8520.275901] CPU: 2 UID: 0 PID: 101418 Comm: kworker/u16:7 Not tainted 6.11.5-300.fc41.x86_64 #1
[ 8520.275904] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
[ 8520.275908] Workqueue: writeback wb_workfn (flush-0:279)
[ 8520.275918] RIP: 0010:nfs_page_group_unlock+0x24/0x40 [nfs]
[ 8520.275955] Code: 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 53 48 89 fb 48 8b 7f 50 48 39 df 74 05 e8 65 ff ff ff 48 8d 7b 38 f0 80 63 38 bf <48> 8b 43 38 f6 c4 10 75 06 5b c3 cc cc cc cc be 06 00 00 00 5b e9
[ 8520.275957] RSP: 0018:ffffb8d1c9e4f820 EFLAGS: 00000206
[ 8520.275961] RAX: 0000000000000001 RBX: ffff97ae029a0800 RCX: ffffffffffffffe0
[ 8520.275963] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff97ae029a0838
[ 8520.275964] RBP: ffff97ae029a0838 R08: 0000000000000018 R09: 0000000000000000
[ 8520.275966] R10: 0000000000000000 R11: 000000000003dc8c R12: ffff97ae029a0800
[ 8520.275967] R13: fffff1c5c01a0000 R14: 0000000000000001 R15: 0000000000000000
[ 8520.275969] FS: 0000000000000000(0000) GS:ffff97b22a500000(0000) knlGS:0000000000000000
[ 8520.275970] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8520.275972] CR2: 00003c020063e000 CR3: 0000000341ad2004 CR4: 0000000000370ef0
[ 8520.275975] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8520.275976] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8520.275978] Call Trace:
[ 8520.275980] <IRQ>
[ 8520.275985] ? watchdog_timer_fn.cold+0x233/0x311
[ 8520.275990] ? __pfx_watchdog_timer_fn+0x10/0x10
[ 8520.275993] ? __hrtimer_run_queues+0x113/0x280
[ 8520.275996] ? kvm_clock_get_cycles+0x18/0x30
[ 8520.275999] ? hrtimer_interrupt+0xfa/0x210
[ 8520.276002] ? __sysvec_apic_timer_interrupt+0x52/0x100
[ 8520.276005] ? sysvec_apic_timer_interrupt+0x6c/0x90
[ 8520.276008] </IRQ>
[ 8520.276009] <TASK>
[ 8520.276010] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 8520.276014] ? nfs_page_group_unlock+0x24/0x40 [nfs]
[ 8520.276045] nfs_lock_and_join_requests+0x1b3/0x270 [nfs]
[ 8520.276082] nfs_page_async_flush+0x1b/0x1f0 [nfs]
[ 8520.276111] ? __pfx_nfs_writepages_callback+0x10/0x10 [nfs]
[ 8520.276141] nfs_writepages_callback+0x30/0x50 [nfs]
[ 8520.276168] write_cache_pages+0x65/0xc0
[ 8520.276172] nfs_writepages+0x162/0x2e0 [nfs]
[ 8520.276200] ? __dev_queue_xmit+0x274/0xe80
[ 8520.276203] ? selinux_ip_postroute+0x114/0x430
[ 8520.276209] do_writepages+0x7e/0x280
[ 8520.276212] __writeback_single_inode+0x41/0x340
[ 8520.276216] writeback_sb_inodes+0x21c/0x4f0
[ 8520.276223] __writeback_inodes_wb+0x4c/0xf0
[ 8520.276226] wb_writeback+0x193/0x310
[ 8520.276229] wb_workfn+0x2af/0x450
[ 8520.276233] process_one_work+0x176/0x330
[ 8520.276237] worker_thread+0x252/0x390
[ 8520.276240] ? __pfx_worker_thread+0x10/0x10
[ 8520.276243] kthread+0xcf/0x100
[ 8520.276245] ? __pfx_kthread+0x10/0x10
[ 8520.276247] ret_from_fork+0x31/0x50
[ 8520.276250] ? __pfx_kthread+0x10/0x10
[ 8520.276252] ret_from_fork_asm+0x1a/0x30
[ 8520.276256] </TASK>
I have been trying to troubleshoot this for the best part of my free time (which is scarce) in the last weeks but so far have not been successful.
Any hint, idea or otherwise direction would be greatly appreciated.
Edit: formatted for better reading