Unstable since moving from Core

I moved from Core things seemed fine for a few hours but since then just constant freezes and hangs. It could be coincidence that it started on the move from Core but now I am seeing CPU errors (usually core 6) I never saw under FreeBSD.

Any hints on what may have caused the CPU issue when it was completely stable under Core. Like 6 months without rebooting. I already updated the pools but I may try going back to Core I can wipe/restore everything if I need to just going to take a long while.

I also ordered a couple new CPUs since its a bit older and a pair of E5 v2s where only 20 dollars, but seems strange it was completely fine under Core until I moved to Scale. Any hints or things to look at? I did try just disabling that core under linux from a post I found but that didnt help.

Thanks

What kind of errors are you seeing? Logs or screenshots?

nothing in the logs but MCE errors complaining about CPU6. It often just completely hangs with no messages and only a hard reboot will get it going.

There are really the only lines in the kern.log I see that point to an issue

May 12 12:23:46 freenas kernel: hid-generic 0003:0557:2221.0009: input,hidraw1: USB HID v1.00 Keyboard [Winbond Electronics Corp Hermon USB hidmouse Device] on usb-0000:00:1a.0-1.6/input1
May 12 15:05:01 freenas kernel: mce: [Hardware Error]: Machine check events logged
May 12 15:05:01 freenas kernel: mce: [Hardware Error]: Machine check events logged
May 12 15:05:01 freenas kernel: mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 11: 8800004800800092
May 12 15:05:01 freenas kernel: mce: [Hardware Error]: TSC a363892ea944 MISC 490845df85df908c
May 12 15:05:01 freenas kernel: mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1715540701 SOCKET 1 APIC 20 microcode 42e
May 12 15:56:27 freenas kernel: mce: [Hardware Error]: Machine check events logged

it could just be a truly bad CPU but I just find it odd it didnt start until I move to Scale literally on the first reboot, and maybe FreeBSD just handled it better. I did also try turning off hyperthreading thinking maybe that would help somehow. It didnt.

Not sure it helps in any way but its a Supermicro X9 board with dual 6 core E5 V2s, 64 GB of ECC Ram. Its been rock stable until the move. I even joked with a friend that maybe it was bad timing and the solar storm killed it.

That error looks like a memory module, whatever is installed in bank 11. Might have nothing to do with CPU at all.

3 Likes

I am not seeing any errors on the Ram on the MB but ill try moving the modules around and see if the error moves. I would think it wouldn’t always be the same CPU core number and could be any core on that CPU package, but it doesn’t hurt to try I guess. All I had done so far was reseat the RAM.

May be on to something as far as RAM, i moved sticks around no errors but just freezing up. I took half the sticks out and its been up for 3 hours so far so fingers crossed. Weird still that Core was fine with them, but I could still blame sunspots maybe. Thanks

Edit: up to 9 hours+ now. I am guessing one of the 4 I took out was the culprit. DDR3 ECC is so cheap anymore I just ordered a 128 to upgrade that box and I will throw the other sticks in a pile just in case there is a need for friends/clients etc.