I moved from Core things seemed fine for a few hours but since then just constant freezes and hangs. It could be coincidence that it started on the move from Core but now I am seeing CPU errors (usually core 6) I never saw under FreeBSD.
Any hints on what may have caused the CPU issue when it was completely stable under Core. Like 6 months without rebooting. I already updated the pools but I may try going back to Core I can wipe/restore everything if I need to just going to take a long while.
I also ordered a couple new CPUs since its a bit older and a pair of E5 v2s where only 20 dollars, but seems strange it was completely fine under Core until I moved to Scale. Any hints or things to look at? I did try just disabling that core under linux from a post I found but that didnt help.
nothing in the logs but MCE errors complaining about CPU6. It often just completely hangs with no messages and only a hard reboot will get it going.
There are really the only lines in the kern.log I see that point to an issue
May 12 12:23:46 freenas kernel: hid-generic 0003:0557:2221.0009: input,hidraw1: USB HID v1.00 Keyboard [Winbond Electronics Corp Hermon USB hidmouse Device] on usb-0000:00:1a.0-1.6/input1
May 12 15:05:01 freenas kernel: mce: [Hardware Error]: Machine check events logged
May 12 15:05:01 freenas kernel: mce: [Hardware Error]: Machine check events logged
May 12 15:05:01 freenas kernel: mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 11: 8800004800800092
May 12 15:05:01 freenas kernel: mce: [Hardware Error]: TSC a363892ea944 MISC 490845df85df908c
May 12 15:05:01 freenas kernel: mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1715540701 SOCKET 1 APIC 20 microcode 42e
May 12 15:56:27 freenas kernel: mce: [Hardware Error]: Machine check events logged
it could just be a truly bad CPU but I just find it odd it didnt start until I move to Scale literally on the first reboot, and maybe FreeBSD just handled it better. I did also try turning off hyperthreading thinking maybe that would help somehow. It didnt.
Not sure it helps in any way but its a Supermicro X9 board with dual 6 core E5 V2s, 64 GB of ECC Ram. Its been rock stable until the move. I even joked with a friend that maybe it was bad timing and the solar storm killed it.
I am not seeing any errors on the Ram on the MB but ill try moving the modules around and see if the error moves. I would think it wouldn’t always be the same CPU core number and could be any core on that CPU package, but it doesn’t hurt to try I guess. All I had done so far was reseat the RAM.
May be on to something as far as RAM, i moved sticks around no errors but just freezing up. I took half the sticks out and its been up for 3 hours so far so fingers crossed. Weird still that Core was fine with them, but I could still blame sunspots maybe. Thanks
Edit: up to 9 hours+ now. I am guessing one of the 4 I took out was the culprit. DDR3 ECC is so cheap anymore I just ordered a 128 to upgrade that box and I will throw the other sticks in a pile just in case there is a need for friends/clients etc.