ECC Memory error - best thing to do?

I am getting the following error:

Memory DRAM ECC ErrorB1 Asserted Correctable ECC.

I am getting lots of them. Many per hour.
So I just spent two weeks running MemTest86 on my machine.

I tested each stick alone thoroughly (4 x 16GB DDR4-2400 ECC) - zero bit errors.
However, every time I insert multiple sticks (4 slots, all 64GB), and run MemTest it generates thousands of errors. It is usually at 1 or 2 address locations during each test - that is, during each MemTest test type (12 test types) only 1 or 2 memory addresses appear as throwing errors.

I suspect the motherboard BIOS has issues, not the memory sticks.
Upgraded to latest BIOS firmware but it did not fix it.

100% of all ECC errors found by MemTest were fixed, but should I accept that and use it as is, or should I replace the board or memory?

Thoughts or suggestions, anyone?

See signature for specific Mobo/RAM make/model.

You have done well troubleshooting so far, hit all the main points.

Now for the RAM testing I’m going to ask a few things:

  1. In the BIOS, what are the RAM settings (voltage, frequency, all of it), only after testing below should it not provide something.
  2. Time to become creative. You have no problems using each Memory stick individually yet all result is failure.
  3. Did you record the address(s) that failed? You said it was fairly consistent. Keep track of this during your testing as it may point you to the suspect slot.
  4. Keep track of each RAM stick, know which one is going where. Very important.
  5. With all the stick in, run your ram test until it fails, write down the address that failed.
  6. Relocate your RAM sticks, swap the two stick in the blue slots, rerun the test, does it fail at the same address? No means it is likely one of those sticks or the slot. Lets say the fault remains the same, now swap the white slot sticks, retest. Does the failed address change?. You can see what I’m doing, right?
  7. If the swapping of the two blue and the two white didn’t change anything, one last thing for completeness, move the stick A2->A1->B2->B1->A2 (rotate them all by one position), rerun the test, if it still fails the same location then odds are it is not the RAM. Unfortunately that means it could be motherboard/cpu.
  8. Last RAM test… Install two sticks into the BLUE slots only, run the test, does it fail or pass. Then the same two sticks into the WHITE slots, retest. The purpose of this is to see if you can isolate it to a slot pair. If you can do that, next install 3 sticks (not recommended for operation, if it runs great, it may not bootstrap at all.

That is all I got for now. If none of that works then the BIOS data may be important.

Good luck.

2 Likes

Is it always this exact error? Because that error tells you which DIMM–or DIMM socket–has the error. If it’s always the same one, that’s going to cut down on your troubleshooting.

2 Likes

The most important step is to use canned air to blow out the slots :wink:

BUT yes, you absolutely need to resolve this issue. And this is why you invest in an ECC capable system.

1 Like

According to the MemTest Reports, it appears to always be the same error: “[ECC Error]”, however, they are always successfully corrected…But it just gives me a bad feeling that there are so many.
The error above is logged in TrueNAS and I can’t recall if it is always “B1”, plus I am not certain that that actually means the physical mobo slot B1 (this has A1, A2, B1, B2), but in MemTest, errors do not appear to be constrained to one slot or module. Sometimes it is Ch/Sl 0-0, sometimes 0-1, sometimes 1-0 and other times 1-1. It can change between tests.

I ran 48 tests.
4 full (thorough) tests - module 1 in slot 1, module 2 in slot 2, module 3 in slot 3, module 4 in slot 4 - over 15 hours per test, to test modules and slots - zero bit errors.
16 summary tests for each module in each slot (4x4), alone, to see if any module/slot combo was a problem - zero bit errors.
24 further summary tests with every permutation of all four modules in all four slots - in every permutation thousands of ECC errors.

I concluded no single module or single slot is a problem - it is only when all modules are inserted.

Yep, did that, used canned air. Even got my microscope out to visually check each slot the entire length. No issues.

So:

  1. this BIOS does not provide the ability to tweak and overclock etc. - no way to change voltages or timings - it is officially a server board. I have mentioned to ASRockRack that I suspect the cause may be voltage and/or timing settings but I cannot change them with the BIOS on this board. Crossing my fingers they respond a second time, though I gave them a huge amount of reading to do.
  2. I tested every permutation of 1 module and every permutation of 4 modules. I worked out that every permutation of 2 modules would be another 72 test regimes :sob: Even then, I only ran tests 1 and 2 from the full MemTest suite because I only wanted to check SPD detection and whether or not I was getting ECC errors (which always happened within seconds of commencing).
  3. MemTest records the last 5-10 errors reported in the HTML test report, and the logfile contains them as well. I don’t know if it would record every single address the errors because the logfile could become astronomical in size. It does stop recording the same error at the same address after a while and says something like “more of the same” or “too many of the same error” (my paraphrase). The exact address that was erroring would change from test to test, but it would remain the same throughout a single test, though sometimes there were 2 addresses that were erroring throughout a given test instance.
  4. You betcha! Don’t worry - I physically labelled each stick with numbers 1-4 and tracked each serial number as I moved them from slot to slot. I also mapped each physical slot to the logical slots in MemTest. Meticulously recorded.
  5. ECC errors start within seconds, though the test continues until it has tested all available memory addresses.
  6. With the detailed reporting I kept I should be able to answer this by checking my results.
  7. I will analyse my results with this in mind and check the addresses on the reports.
  8. I admit this is a perfectly valid point, so I will do this with some preliminary permutations of 2 sticks.

I’ll see if I can upload my report here if anyone is interested in how I recorded the test results.

As I said, I really believe the issue to be BIOS related. One other point I didn’t mention is that MemTest has a mobo blacklist of boards that fail to run any of the parallel CPU memory tests - this board fails when running MemTest in Parallel CPU mode (which is really frustrating because it takes sooo much longer, and sometimes errors only show up when you’re able to stress the RAM by pushing it to it’s limits with parallel hammering). So all testing HAD to be done in single CPU mode.

Thanks for all the great questions and pointers.
I’ll report back…

None of that sounds good unfortunately.

You could replace the power supply to see if that has an affect, however if the same address range is being reported regardless of the RAM locations and you cleaned out the slots, I’d have to conclude there is a problem with the motherboard or the RAM is incorrect. What is the exact part number of the RAM (take a few photos and toss them up here as there could be many numbers listed). Let’s verify the RAM is not the problem before going much further. Also, where did you get the motherboard from? If you purchased new or from a reputable seller, you might consider a return and tell them it is faulty.

WARNING … WARNING
Do not do these things if you have no idea what you are doing. Even what might seem to be a minor change to you could be a very significant change to the electronics.

I know you said there were no settings to tweak the RAM timing but if you find a BIOS that does allow it, I’d first drop the speed down one level. If that fails, return the speed to normal and bump the voltage up .01vdc.

Time for work, check back in 11 hours to see what has happened.

Hey there!

I guess the CPU is soldered to the motherboard…

Regarding the RAM tests:
YOu should try to run the tests with 3 modules only in various setups.
If there are no errors, the unused one is somehow weak…
But, it will be a long and painful test…
Do you have access to another DDR4 ECC RAM supporting system?
If yes, you should do a cross-check with the memories. (if the RAM is faulty in the other system too, then it is the RAM, if the other RAM fails in your current system, it is most likely the MoBo)
(honestly, I already would have bought a new set of RAM from Aliexpress and forget about this. Mainly to buy 2x32GB instead of 4x16GB)

That crossed my mind except the PSU output is big enough to power all 8 HDDs plus some, and I have removed all HDDs for all these memory tests (because the memory slots are impossible to get to unless all the drive cables are disconnected and out of the way). Not enough juice should not be a problem here…

Not sure how to determine which address ranges are in which slots. I parsed all MemTest reports with perl to create a CSV file of erroring addresses. Here is what I found:

Channel-Slot Address Syndrome Test Log Count - Log
1-0 170062AC0 0064 3 MemTest86-20240520-001047.log 211
4 MemTest86-20240520-001047.log 289
19428C00 0064 0 MemTest86-20240523-131838.log 8
1 MemTest86-20240523-131838.log 89
1B104000 0064 0 MemTest86-20240517-220712.log 4
1 MemTest86-20240517-220712.log 35
1BE20240 0064 0 MemTest86-20240523-135049.log 8
1 MemTest86-20240523-135049.log 80
1CA44D00 0064 0 MemTest86-20240523-174729.log 12
1 MemTest86-20240523-174729.log 108
2 MemTest86-20240523-174729.log 8
1FFA6840 0064 0 MemTest86-20240523-151128.log 4
1 MemTest86-20240523-151128.log 73
24E86F640 0064 0 MemTest86-20240523-121937.log 9
1 MemTest86-20240523-121937.log 93
2 MemTest86-20240523-121937.log 35
3 MemTest86-20240523-121937.log 7
29FDE8B40 0064 0 MemTest86-20240523-185722.log 4
1 MemTest86-20240523-185722.log 54
2ADC6740 0064 1 MemTest86-20240518-172317.log 34
2 MemTest86-20240518-172317.log 1
35CDEC600 0064 0 MemTest86-20240523-015251.log 4
1 MemTest86-20240523-015251.log 51
37923940 0064 0 MemTest86-20240523-024336.log 8
1 MemTest86-20240523-024336.log 76
37C2AFC0 0064 0 MemTest86-20240523-100615.log 8
1 MemTest86-20240523-100615.log 82
397035C0 0064 0 MemTest86-20240523-201511.log 8
1 MemTest86-20240523-201511.log 65
3A702AC0 0064 0 MemTest86-20240523-105200.log 12
1 MemTest86-20240523-105200.log 81
3BAC0900 0064 0 MemTest86-20240523-093821.log 8
1 MemTest86-20240523-093821.log 81
3BEE9C00 0064 0 MemTest86-20240523-204116.log 13
1 MemTest86-20240523-204116.log 112
3C028A40 0064 0 MemTest86-20240523-125501.log 8
1 MemTest86-20240523-125501.log 80
3C269240 0064 0 MemTest86-20240515-154600.log 9
1 MemTest86-20240515-154600.log 46
3C80AF40 0064 0 MemTest86-20240523-005637.log 8
1 MemTest86-20240523-005637.log 72
3CA2BB00 0064 0 MemTest86-20240523-030920.log 8
1 MemTest86-20240523-030920.log 68
3CA60E00 0064 0 MemTest86-20240511-225019.log 9
1 MemTest86-20240511-225019.log 51
3D86BD40 0064 0 MemTest86-20240517-222533.log 4
1 MemTest86-20240517-222533.log 18
3D90B900 0064 0 MemTest86-20240517-224125.log 4
1 MemTest86-20240517-224125.log 35
3DC0CA40 0064 0 MemTest86-20240511-152937.log 18
1 MemTest86-20240511-152937.log 59
3EDA3540 0064 0 MemTest86-20240511-175134.log 9
1 MemTest86-20240511-175134.log 50
3F162A00 0064 0 MemTest86-20240523-183239.log 8
1 MemTest86-20240523-183239.log 94
4B06A9600 0064 3 MemTest86-20240520-072710.log 107
4 MemTest86-20240520-072710.log 375
4FA419C0 0064 1 MemTest86-20240517-220319.log 33
5AB81D80 0064 0 MemTest86-20240523-171915.log 8
1 MemTest86-20240523-171915.log 80
1-1 18353300 0085 0 MemTest86-20240523-111759.log 8
1 MemTest86-20240523-111759.log 76
33A8F0A00 0085 0 MemTest86-20240523-153532.log 8
1 MemTest86-20240523-153532.log 68
3B03DA80 0064 0 MemTest86-20240523-141321.log 8
1 MemTest86-20240523-141321.log 72
3C158080 0064 0 MemTest86-20240519-085944.log 4
1 MemTest86-20240519-085944.log 37
3C5397D80 0085 1 MemTest86-20240523-114200.log 17
3C63F500 0064 0 MemTest86-20240523-210730.log 12
1 MemTest86-20240523-210730.log 75
3CA9EB00 0085 0 MemTest86-20240518-172625.log 4
1 MemTest86-20240518-172625.log 35
2 MemTest86-20240518-172625.log 51
3 MemTest86-20240518-172625.log 298
4 MemTest86-20240518-172625.log 112
3CAD9640 0064 0 MemTest86-20240523-163328.log 12
1 MemTest86-20240523-163328.log 82
3D0BE680 0085 0 MemTest86-20240515-162614.log 9
1 MemTest86-20240515-162614.log 44
2 MemTest86-20240515-162614.log 45
3 MemTest86-20240515-162614.log 353
4 MemTest86-20240515-162614.log 49
3DFFC200 0085 0 MemTest86-20240515-152450.log 9
1 MemTest86-20240515-152450.log 46
Total Result 4570

The mapping between the Channel/Slot (above) and the Physical Slot is not clear to me and I haven’t been able to deduce it from the logs.
The above table does not indicate to me that there are specific addresses/locations that are the problem…and so neither do they indicate to me a problem with the memory itself.
Also waiting to find out on the PassMark MemTest86 forum what the Syndrome codes mean.

Samsung M393A2K43BB1-CRC

wisp.net.au
But it’s technically out of the 12-month warranty…it took ages to get the RAM and start building this box…my own fault.

Already did that. The board and RAM are DDR4-2400. BIOS let’s me choose from 2400, 2133,1866,1600. I tried 2133, then 1866. Made no difference - still ECC errors when multiple sticks in.

I am suspicious this might be the issue - not enough voltage to RAM. But this is not possible. There is no BIOS setting that allows this :frowning_face:

Yep.

No.

This is my suspicion - and this mobo was hard to find, and it will be very hard to replace. 13x SATA ports + 4x ECC DDR4 slots + ITX size seems to be a rare combination of specs. I got one of last ones.
:grimacing:
I am kicking off another test, this time with two sticks inserted into the WHITE slots.
I’ll see how they go…

WHITE slots under test…been running for 20 mins now with 2 sticks in the WHITE slots only and zero errors so far. This is the first positive sign I’ve had in the last 2 weeks. When all 4 sticks were in, ECC errors were literally popping up within seconds of commencing the test run.

It may turn out that I can only use 2 of the 4 slots :cry:
Damn, half the RAM…but that will be 32GB which should be enough to get on with using this box.

…and so to bed…

Lets say only the white slots are working perfectly once you are done testing. Next is to place an additional stick (3rd stick) into the blue slot A1. Run your testing again. If it passes then fantastic, you have a little more RAM to work with. If it fails, then move the stick to the other blue slot and try again.

What is the problem doing it this way if you can run 3 sticks? Well it is not supported. Since it is not supported I would run Memtest86 on it for many days, just to give yourself a good feeling that the system is stable. But wait! Stable you say? Run a CPU Stress Test at some point in time with the RAM in it’s final configuration, make sure it really is stable.

I suspect you either have a faulty slot connector or the cpu is not soldered to the board properly. It happens unfortunately. For this reason, if you can get it replaced for free, I’d do it.

2 Likes

Has this board always had this problem?

Another cause can be if you have multiple memory devices on your dimm (i believe this is referred to as rank). Ie it’s sort of two dimms in one.

Then you’re effectively trying to use 4 dimms per channel in something only rated for two dimms per channel

This comes back to the QVL list for the board.

Next day…

MemTest86 Pass 1 done. 12 tests each pass.
Up to Test 8 of 12 in Pass 2.
8 hrs 49 minutes and counting.

Zero errors.

Since I finished building it, yes, I believe so. I haven’t started using it yet because of these issues.

All I know is that the specification states it can take “DDR4 1866/2133/2400 Dual-channel Max. 128 GB UDIMM, Non-ECC, ECC DIMM, RDIMM”, and it has four physical slots. Memory channels are a motherboard attribute, not a memory module attribute (at least, up to and including DDR4). That is to say that the number of channels is not determined by the kind of sticks that go in, it is determined by the number of paths between the CPU and the memory slots on the board.

I cannot see why they would put four slots on the board and make it so only two can be used at a time…that seems ridiculous to me, and it would be overengineering (they would not allow it in the design budget).

Am I missing something obvious?

The white and blue slot pairs are both the same channel.

If you connect two dimms you have two dimms on the same channel.

There are electrical tolerances at play, and it’s possible for a marginally compatible device to not be compatible when there are two on the same channel.

Not all dimms that “meet the specifications” will work in all motherboards with said specifications.

I’m not sure how long you’ve had this board in your possession, but it appears to be incompatible with your memory when used 2 dimms per channel.

Maybe that was why it was returned and was thus open box in the first place?

Australian Consumer Law may be on your side in getting a return here. Assuming you are located in Australia.

Meanwhile, it’s throwing so many errors, it’s only a matter of time until you have two errors in the same location and thus can’t be corrected.

What I deduced by inserting one module into one physical slot at a time:


SPD # does not map to physical slots, but rather to modules where the SPD EEPROM is successfully read during boot. An SPD EEPROM chip on a module may not be successfully read, but the module is still available as memory.
Note the BLUE and WHITE boxes around each BANK.

The documentation clearly states this is a dual channel motherboard. So, are you saying that the two BLUE slots are one channel, and the two WHITE slots are the other channel? Or do you mean there is an A channel (A1 & A2) and a B channel (B1 & B2)?

True. I have been wondering if this is what is happening.

I am the first owner AFAIK. I didn’t mean to intimate that. What I meant was that it took me over a year to complete my build and start testing…this is my own fault for being indecisive and disorganized.

Yes I’m here, but I don’t think it will help in this case.
:frowning_face:
It’s been under test now for 10 hrs 50 minutes with modules inserted in both WHITE slots (A2 and B2). I expect no errors for the rest of this test, so I think I’ll cut it short and swap to the BLUE slots (A1 and B1). If that appears to be going ok for a while, then I’ll insert a third module into one WHITE slot to see if I get errors. I’ll try all 3 slot population combinations. I’ll also try both A slots and both B slots, then I’ll know I have tried two modules in one channel as well as one module in both channels.

Generally the colors indicate banks, and a combination of colors is a channel.

But not always :wink:

And your motherboard manual may state which slot/colour to use first. It may even state exactly which to use with what combination of dimms.

When you put multiple dimms in the same channel it’s as if they are concatenated together.

Thus, if you have the same amount of memory in both channels, then you will get dual channel speed for all of your memory.

BUT if you have say 50% less in one channel then you will get dual channel only when access the parts of memory that have dimms in both channels.

Thus it’s better to install two dimms in two channels that one per bank in the same channel.

1 Like

Throwing something out left-field that I encountered with my SuperMicro X10 board a month or so back when I was testing things out - if I enabled what in the manual is called “memory scrambler” which says “This feature enables or disables memory scrambler support for memory error correction.” I got ECC errors. As soon as I turned that off, I didn’t anymore. Do you have a setting like that in your BIOS?


“DDR4_A21” is obviously a typo in the above screenshot.

There is a difference between “not supported”, “not compatible” and “not validated”, and in this case “not allowed”

I think point 1) above should actually be “should” not “always need to”, matching pairs is good practise… but in theory a memory controller should pick the lowest common settings to run the memory at

  1. is correct.
  2. is probably correct
  3. is probably correct, but as implied, “NOT supported” doesn’t mean it won’t work, especially since they merely “suggest” not using 3 memory modules… rather than stating it will not work.