After swapping my motherboard and cpu my server freezes at night.
The timeline is like so:
The server is on, at like 1 am the reporting for synced time and some other things stop but there are no errors in syslog
and it 1:48 am the server has an error with the scheduler
after that it cant be reached, but the video output works and the cursor blinks BUT i cant type anything altho the numlock works
It is confusing
Apr 9 01:54:35 truenas kernel: BUG: scheduling while atomic: swapper/2/0/0x00000000
Apr 9 01:54:35 truenas kernel: Modules linked in: iptable_filter(E) iptable_nat(E) wireguard(E) libchacha20poly1305(E) chacha_x86_64(E) poly1305_x86_64(E) curve25519_x86_64(E) libcurve25519_generic(E) libchacha(E) ip6_udp_tunnel(E) udp_tunnel(E) tun(E) nvidia_uvm(POE) xt_nat(E) xt_tcpudp(E) veth(E) xt_conntrack(E)
nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nf_conntrack_netlink(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) nft_compat(E) nf_tables(E) libcrc32c(E) crc32c_generic(E) nfnetlink(E) br_netfilter(E) bridge(E) stp(E) 11c(E) nvme_fabrics(E) sunrpc(E) binfmt_misc(E) ntb_
netdev(E) ntb_transport(E) ntb_split(E) ntb(E) ioatdma(E) nvidia_drm(POE) nvidia_modeset(POE) drm_kms_helper(E) video(E) nvidia(POE) overlay(E) squashfs(E) ib_core(E) intel_rapl_msr(E) intel_rapl_common(E) edac_mce_amd(E) kvm_amd(E) kvm(E) irqbypass(E) ghash_clmulni_intel(E) sha512_ssse3(E) sha256_ssse3(E) sha1_ssse
(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) snd_intel_dspcfg(E) snd_hda_codec(E) aesni_intel(E)
Apr 9 01:54:35 truenas kernel: crypto_simd(E) snd_hda_core(E) cryptd(E) snd_hwdep(E) rapl(E) snd_pcm(E) snd_timer(E) snd(E) wmi_bmof(E) mxm_wmi(E) sp5100_tco(E) soundcore(E) ccp(E) pcspkr(E) k10temp(E) watchdog(E) evdev(E) sg(E) button(E) loop(E) drm(E) efi_pstore(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E)
zfs(POE) spl(OE) efivarfs(E) hid_semitek(E) hid_generic(E) usbhid(E) hid(E) sd_mod(E) nvme(E) ahci(E) ahciem(E) nvme_core(E) xhci_pci(E) libahci(E) mpt3sas(E) t10_pi(E) raid_class(E) xhci_hcd(E) crc32_pclmul(E) libata(E) scsi_transport_sas(E) crc32c_intel(E) crc64_rocksoft(E) igb(E) crc64(E) i2c_piix4(E) i2c_algo_b
(E) nid semitek(E) h
it(E) usbcore(E) crc_t10dif(E) dca(E) scsi_mod(E) crct10dif_generic(E) crct10dif_pclmul(E) scsi_common(E) usb_common(E) crct10dif_common(E) gpio_amdpt(E) wmi(E) gpio_generic(E)
Apr 9 01:54:35 truenas kernel:CPU: 2 PID: 0 Comm: swapper/2 Tainted: P OE 6.6.44-production+truenas #1
Apr 9 01:54:35 truenas kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470 Gaming K4, BIOS P10.31 08/22/2024
Apr 9 01:54:35 truenas kernel: Call Trace:
Apr 9 01:54:35 truenas kernel:
Apr 9 01:54:35 truenas kernel: dump_stack_lvl+0x47/0x60
Apr 9 01:54:35 truenas kernel: __schedule_bug+0x56/0x70
Apr 9 01:54:35 truenas kernel: __schedule+0x877/0x950
Apr 9 01:54:35 truenas kernel: ? cpuidle_enter_state+0xbd/0x440
Apr 9 01:54:35 truenas kernel: schedule_idle+0x2a/0x40
Apr 9 01:54:35 truenas kernel: do_idle+0x16e/0x270
Apr 9 01:54:35 truenas kernel: cpu_startup_entry+0x2a/0x30
Apr 9 01:54:35 truenas kernel: start_secondary+0x11e/0x140
Apr 9 01:54:35 truenas kernel: secondary_startup_64_no_verify+0x18f/0x19b
Apr 9 01:54:35 truenas kernel:
in the picture is the error i get. It says something like cpu sleep but i have turned everything off like recommended for AMD CPUs
That is Deep Sleep; Global C States; PSU Idle Current; but i have put on the ECO Mode to limit it to 65W which shouldnt activate sleep states? It only says current limit in the BIOS
I have a 3600X and the X470 K4 motherboard adn 32 GB of Ram which i testet for longer than the Uptime this time here.
Turn off the Eco mode and see what happens. Leave the rest of the stuff you mention turned off
Just FYI over Netstat ive seen that the CPU was in C1 state at that time.
This is really the reason why I only build servers with enterprise gear. I used to have these random freezes once every couple of months with my old build that I could never chase down the root cause of.
Needless to say, I haven’t experienced a random system freeze in over a decade of 24/7 operation since going full enterprise gear. It’s more expensive in upfront costs, but the peace of mind from not having to chase random crashes/errors out of the blue is priceless. It more than made up the cost I paid for all the mental costs it saved me.
2 Likes
Idle Current should be set to typical, not off. This assumes an updated BIOS.
After that a natural step would be memtest86 overnight.
Yes i meant that with the typical just a brain typo, i did a memtest for like 2 passes which was 6 hours with 0 errors so i would assume that for the like 3 hours it was actually on before crashing it was not the memory
So i changed an option called Datatfabric C States short DF CStates somewhere in the AMD CBS Subsection in the Bios as well as the L1 and L2 streamers tomorrow we will see if it survives the night
So the beahviour changed, now there is no cpu_schedule error it just crashes without any error in syslog so now i can maybe try to reseed the ram and stress it again
Update:
I have now reseeded the memory but in the screen of the memory tester under memory it says 32 GB of Memory but only 8.6 GB/s speed for it, these are 4 8GB 2133 Sticks. I googled around and they should be at like 17 GB/s. So i am now more confused.
Did you recently reset the CMOS? Most likely it reverted back to slower more conservative setting if that’s the case.
I did not but i guess this is my last problem rn. I Have checked the Memory completely ith no errors right now i get Kernel Panics and i can only see them on the Console which then automatically reboots after 10 sec. I cant find these in the Syslogs or anywhere else, even when downloading the DEBUG Folder from the UI. Is there any place a kernel panic is saved?
So guys, i have taken a slow motion video of my console to see the log for the Kernel Panic and it still says cpu enter idle but a lot of
exc_page_fault
asm_exc_page_fault
and
early_xen_write
I have found a tool what is available in truenas called
cpupower
and here you can set idle states, i am trying this now and will report tomorrow with my findings
So after trying pretty much everything without swapping the cpu or motherboard i am not going to work on this right now any more its just headaches and i am frustrated.
What i have tried so far was:
- Testing RAM
- Removing PCIE Cards
- Changing BIOS Settings to recommended for AMD and even beyond
- disabling C states in linux itself
- downgrading BIOS to an older Version where there were no S3/S4 state enhancements but to no avail
- (Edit) Ive reinstalled Truenas itself
So i am quitting for now but maybe in a cupple of weeks i am trying a different CPU
If you have more ideas please tell them to me but until then i guess
Update:
Ive tried another Motherboard with a drifferent AMD CPU but x570 Chipset and different RAM (just transplanted another Computer in there) which worked fine for over a day
Ive swapped in my CPU from my failing CPU X470 Moverboard Combo which also worked fine for over 14 hours
So now i know that the CPU is good and the RAM from the other system is known good so i have swapped the CPU and RAM into the porblematic X470 Motherboard and is still not working.
What i have seen is that on the X470 Motherboard the zfs ARC goes beyond the 50 % mark of my ram what i have never seen from any other Motherboard. Ive limited it now to 16 GB manually by editing
/sys/modules/zfs/parameters/zfs_arc_max
with no difference so i think this motherboard is just toast?
Could be. I guess the next test is the motherboard.