Are these errors okay?

I’ve been running a set of 16x2 ECC RAM with years without issue. Recently I upgraded to 32x2 sticks, and I started receiving some ECC errors every 5 hours. I RMA’d the pair of sticks, and the new pair also has the same issue. While I waited for the new sticks to arrive, I ran the old sticks for weeks again with 0 issues.

Are these errors okay or should I just stick with my 32gb total set from before? Anything else I can use to troubleshoot?

Example errors:

Jan 27 04:31:49 test kernel: mce: [Hardware Error]: Machine check events logged
Jan 27 04:31:49 test kernel: [Hardware Error]: Corrected error, no action required.
Jan 27 04:31:49 test kernel: [Hardware Error]: CPU:0 (17:8:2) MC15_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
Jan 27 04:31:49 test kernel: [Hardware Error]: Error Addr: 0x00000000b1982dc0
Jan 27 04:31:49 test kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000005200a400103
Jan 27 04:31:49 test kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Jan 27 04:31:49 test kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x183305 offset:0xac0 grain:64 syndrome:0x520)
Jan 27 04:31:49 test kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 28 07:51:18 test kernel: mce: [Hardware Error]: Machine check events logged
Jan 28 07:51:18 test kernel: [Hardware Error]: Corrected error, no action required.
Jan 28 07:51:18 test kernel: [Hardware Error]: CPU:0 (17:8:2) MC15_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
Jan 28 07:51:18 test kernel: [Hardware Error]: Error Addr: 0x00000000b6f10dc0
Jan 28 07:51:18 test kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000005200a400103
Jan 28 07:51:18 test kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Jan 28 07:51:18 test kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x18de21 offset:0xac0 grain:64 syndrome:0x520)
Jan 28 07:51:18 test kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Jan 29 04:10:33 test kernel: mce: [Hardware Error]: Machine check events logged
Jan 29 04:10:33 test kernel: [Hardware Error]: Corrected error, no action required.
Jan 29 04:10:33 test kernel: [Hardware Error]: CPU:0 (17:8:2) MC15_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000011b
Jan 29 04:10:33 test kernel: [Hardware Error]: Error Addr: 0x0000000263510dc0
Jan 29 04:10:33 test kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000005200a400103
Jan 29 04:10:33 test kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0
Jan 29 04:10:33 test kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x4e6a21 offset:0xbc0 grain:64 syndrome:0x520)
Jan 29 04:10:33 test kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

Check your motherboard manual and the QVL list for RAM on the manufacturer website. You didn’t give details on the system so we can only guess off the error you posted.

You could try booting a live linux version and stress test your CPU and ram. Memtest86 and at least 5 passes is the usual recommendation for the RAM testing.

Hit up the support website and check the docs against your hardware and setup. They show a few processor lists with RAM. Bios updates may be required too. Check your version and release notes.
Check your memory timings. No overclocks, etc.

https://www.asrock.com/mb/AMD/X370%20Taichi/#Support

1 Like

New RAM: NEMIX RAM 64GB (2X32GB) DDR4 3200MHZ PC4-25600 2Rx8 1.2V CL22 288-PIN ECC Unbuffered UDIMM KIT at Amazon.com

The mobo is the asrock x370 taichi.

I’d argue that these are not ok - I get maybe a corrected error once every 6-12 months. You could clean/reseat the dimms, play with voltages & timings, and/or RMA. Maybe it is something as simple as a bios update.

Currently your ecc is working in the sense that it is correcting the errors - but das a lot of errors for 3 days imo.

Have you run the obvious test, MemTest86+ ? Since you are experiencing errors, I’d recommend you run it for 10 complete passes. If you are lucky, you will have errors generated quickly which will help you troubleshoot the issue.

One two things to consider here, your motherboard could be bad or your CPU could be bad. I am specifically talking about the address lines to your RAM or temperamental voltage regulators. Your RAM must be in slots A2 and B2. Have you read your User Manual? It is pretty specific about the slots and speeds you can run normally.

My advice:

  1. Run MemTest86+ (free software) and when the RAM fails, write down how long it took to failure and what test failed. Also write down the memory address that failed.
  2. Power Off, wait 30 seconds (arbitrary number), power on and run MemTest86+ again. When it fails, write everything down again.
  3. Now go into your BIOS and change the RAM clock speed to whatever speed is the next lowest value. Let’s be clear, 3200 and 2933 are considered Overclocked for your motherboard, this “could” be giving you problems. You may want to just drop it to 2667.
  4. Repeat the MemTest86+ testing. With luck it will pass and then have it pass 10 complete passes. This will take time but you must ensure your system is stable or why have it at all.
  5. If you find that the errors are still occurring, remove the RAM from slot B2 and only test using A2 slot. Retest. If that fails, move the RAM to the B2 slot, even thought this is not specifically listed in the User Manual, it generally works, and retest.
  6. Report Back what you find out.

I would not ask you to perform ANY voltage adjustments of any kind. I have no idea what your knowledge level is but even a 0.01VDC increase could damage components, even as low of a voltage as it is.

Take your time troubleshooting.

4 Likes

Awesome post. I downclocked to 2667 and I have no errors. I feel like this speed is fine for truenas / VMs and it’s not worth tuning up until I see errors again.

What do you think?

1 Like

I think that 2667 is probably the safest bet. I hope you have zero errors from here on.

Id suggest letting it run for 10 complete passes and seeing what happens. And who knows the errors will pop up early, which is actually a good thing. The sooner they show themselves, the faster you can pinpoint what’s going wrong and get it fixed.

I ran 9 passes of memtest 8.00 with no errors. Booting back into truenas I still see the ECC corrections periodically. Not sure what to do now :confused:

I’m also running the RAM at 2600 mhz for context.