ECC Memory error - best thing to do?

nas · May 28, 2024, 12:01am

I am getting the following error:

Memory DRAM ECC ErrorB1 Asserted Correctable ECC.

I am getting lots of them. Many per hour.
So I just spent two weeks running MemTest86 on my machine.

I tested each stick alone thoroughly (4 x 16GB DDR4-2400 ECC) - zero bit errors.
However, every time I insert multiple sticks (4 slots, all 64GB), and run MemTest it generates thousands of errors. It is usually at 1 or 2 address locations during each test - that is, during each MemTest test type (12 test types) only 1 or 2 memory addresses appear as throwing errors.

I suspect the motherboard BIOS has issues, not the memory sticks.
Upgraded to latest BIOS firmware but it did not fix it.

100% of all ECC errors found by MemTest were fixed, but should I accept that and use it as is, or should I replace the board or memory?

Thoughts or suggestions, anyone?

See signature for specific Mobo/RAM make/model.

joeschmuck · May 28, 2024, 1:33am

You have done well troubleshooting so far, hit all the main points.

Now for the RAM testing I’m going to ask a few things:

In the BIOS, what are the RAM settings (voltage, frequency, all of it), only after testing below should it not provide something.
Time to become creative. You have no problems using each Memory stick individually yet all result is failure.
Did you record the address(s) that failed? You said it was fairly consistent. Keep track of this during your testing as it may point you to the suspect slot.
Keep track of each RAM stick, know which one is going where. Very important.
With all the stick in, run your ram test until it fails, write down the address that failed.
Relocate your RAM sticks, swap the two stick in the blue slots, rerun the test, does it fail at the same address? No means it is likely one of those sticks or the slot. Lets say the fault remains the same, now swap the white slot sticks, retest. Does the failed address change?. You can see what I’m doing, right?
If the swapping of the two blue and the two white didn’t change anything, one last thing for completeness, move the stick A2->A1->B2->B1->A2 (rotate them all by one position), rerun the test, if it still fails the same location then odds are it is not the RAM. Unfortunately that means it could be motherboard/cpu.
Last RAM test… Install two sticks into the BLUE slots only, run the test, does it fail or pass. Then the same two sticks into the WHITE slots, retest. The purpose of this is to see if you can isolate it to a slot pair. If you can do that, next install 3 sticks (not recommended for operation, if it runs great, it may not bootstrap at all.

That is all I got for now. If none of that works then the BIOS data may be important.

Good luck.

dan · May 28, 2024, 1:41am

Is it always this exact error? Because that error tells you which DIMM–or DIMM socket–has the error. If it’s always the same one, that’s going to cut down on your troubleshooting.

Stux · May 28, 2024, 2:16am

The most important step is to use canned air to blow out the slots

BUT yes, you absolutely need to resolve this issue. And this is why you invest in an ECC capable system.

nas · May 28, 2024, 6:23am

According to the MemTest Reports, it appears to always be the same error: “[ECC Error]”, however, they are always successfully corrected…But it just gives me a bad feeling that there are so many.
The error above is logged in TrueNAS and I can’t recall if it is always “B1”, plus I am not certain that that actually means the physical mobo slot B1 (this has A1, A2, B1, B2), but in MemTest, errors do not appear to be constrained to one slot or module. Sometimes it is Ch/Sl 0-0, sometimes 0-1, sometimes 1-0 and other times 1-1. It can change between tests.

I ran 48 tests.
4 full (thorough) tests - module 1 in slot 1, module 2 in slot 2, module 3 in slot 3, module 4 in slot 4 - over 15 hours per test, to test modules and slots - zero bit errors.
16 summary tests for each module in each slot (4x4), alone, to see if any module/slot combo was a problem - zero bit errors.
24 further summary tests with every permutation of all four modules in all four slots - in every permutation thousands of ECC errors.

I concluded no single module or single slot is a problem - it is only when all modules are inserted.

nas · May 28, 2024, 6:28am

Yep, did that, used canned air. Even got my microscope out to visually check each slot the entire length. No issues.

nas · May 28, 2024, 6:50am

So:

this BIOS does not provide the ability to tweak and overclock etc. - no way to change voltages or timings - it is officially a server board. I have mentioned to ASRockRack that I suspect the cause may be voltage and/or timing settings but I cannot change them with the BIOS on this board. Crossing my fingers they respond a second time, though I gave them a huge amount of reading to do.
I tested every permutation of 1 module and every permutation of 4 modules. I worked out that every permutation of 2 modules would be another 72 test regimes Even then, I only ran tests 1 and 2 from the full MemTest suite because I only wanted to check SPD detection and whether or not I was getting ECC errors (which always happened within seconds of commencing).
MemTest records the last 5-10 errors reported in the HTML test report, and the logfile contains them as well. I don’t know if it would record every single address the errors because the logfile could become astronomical in size. It does stop recording the same error at the same address after a while and says something like “more of the same” or “too many of the same error” (my paraphrase). The exact address that was erroring would change from test to test, but it would remain the same throughout a single test, though sometimes there were 2 addresses that were erroring throughout a given test instance.
You betcha! Don’t worry - I physically labelled each stick with numbers 1-4 and tracked each serial number as I moved them from slot to slot. I also mapped each physical slot to the logical slots in MemTest. Meticulously recorded.
ECC errors start within seconds, though the test continues until it has tested all available memory addresses.
With the detailed reporting I kept I should be able to answer this by checking my results.
I will analyse my results with this in mind and check the addresses on the reports.
I admit this is a perfectly valid point, so I will do this with some preliminary permutations of 2 sticks.

I’ll see if I can upload my report here if anyone is interested in how I recorded the test results.

As I said, I really believe the issue to be BIOS related. One other point I didn’t mention is that MemTest has a mobo blacklist of boards that fail to run any of the parallel CPU memory tests - this board fails when running MemTest in Parallel CPU mode (which is really frustrating because it takes sooo much longer, and sometimes errors only show up when you’re able to stress the RAM by pushing it to it’s limits with parallel hammering). So all testing HAD to be done in single CPU mode.

Thanks for all the great questions and pointers.
I’ll report back…

joeschmuck · May 28, 2024, 9:38am

None of that sounds good unfortunately.

You could replace the power supply to see if that has an affect, however if the same address range is being reported regardless of the RAM locations and you cleaned out the slots, I’d have to conclude there is a problem with the motherboard or the RAM is incorrect. What is the exact part number of the RAM (take a few photos and toss them up here as there could be many numbers listed). Let’s verify the RAM is not the problem before going much further. Also, where did you get the motherboard from? If you purchased new or from a reputable seller, you might consider a return and tell them it is faulty.

WARNING … WARNING
Do not do these things if you have no idea what you are doing. Even what might seem to be a minor change to you could be a very significant change to the electronics.

I know you said there were no settings to tweak the RAM timing but if you find a BIOS that does allow it, I’d first drop the speed down one level. If that fails, return the speed to normal and bump the voltage up .01vdc.

Time for work, check back in 11 hours to see what has happened.

Gyula_Masa · May 28, 2024, 11:36am

Hey there!

I guess the CPU is soldered to the motherboard…

Regarding the RAM tests:
YOu should try to run the tests with 3 modules only in various setups.
If there are no errors, the unused one is somehow weak…
But, it will be a long and painful test…
Do you have access to another DDR4 ECC RAM supporting system?
If yes, you should do a cross-check with the memories. (if the RAM is faulty in the other system too, then it is the RAM, if the other RAM fails in your current system, it is most likely the MoBo)
(honestly, I already would have bought a new set of RAM from Aliexpress and forget about this. Mainly to buy 2x32GB instead of 4x16GB)

nas · May 28, 2024, 2:44pm

That crossed my mind except the PSU output is big enough to power all 8 HDDs plus some, and I have removed all HDDs for all these memory tests (because the memory slots are impossible to get to unless all the drive cables are disconnected and out of the way). Not enough juice should not be a problem here…

Not sure how to determine which address ranges are in which slots. I parsed all MemTest reports with perl to create a CSV file of erroring addresses. Here is what I found:

Channel-Slot	Address	Syndrome	Test	Log	Count - Log
1-0	170062AC0	0064	3	MemTest86-20240520-001047.log	211
			4	MemTest86-20240520-001047.log	289
	19428C00	0064	0	MemTest86-20240523-131838.log	8
			1	MemTest86-20240523-131838.log	89
	1B104000	0064	0	MemTest86-20240517-220712.log	4
			1	MemTest86-20240517-220712.log	35
	1BE20240	0064	0	MemTest86-20240523-135049.log	8
			1	MemTest86-20240523-135049.log	80
	1CA44D00	0064	0	MemTest86-20240523-174729.log	12
			1	MemTest86-20240523-174729.log	108
			2	MemTest86-20240523-174729.log	8
	1FFA6840	0064	0	MemTest86-20240523-151128.log	4
			1	MemTest86-20240523-151128.log	73
	24E86F640	0064	0	MemTest86-20240523-121937.log	9
			1	MemTest86-20240523-121937.log	93
			2	MemTest86-20240523-121937.log	35
			3	MemTest86-20240523-121937.log	7
	29FDE8B40	0064	0	MemTest86-20240523-185722.log	4
			1	MemTest86-20240523-185722.log	54
	2ADC6740	0064	1	MemTest86-20240518-172317.log	34
			2	MemTest86-20240518-172317.log	1
	35CDEC600	0064	0	MemTest86-20240523-015251.log	4
			1	MemTest86-20240523-015251.log	51
	37923940	0064	0	MemTest86-20240523-024336.log	8
			1	MemTest86-20240523-024336.log	76
	37C2AFC0	0064	0	MemTest86-20240523-100615.log	8
			1	MemTest86-20240523-100615.log	82
	397035C0	0064	0	MemTest86-20240523-201511.log	8
			1	MemTest86-20240523-201511.log	65
	3A702AC0	0064	0	MemTest86-20240523-105200.log	12
			1	MemTest86-20240523-105200.log	81
	3BAC0900	0064	0	MemTest86-20240523-093821.log	8
			1	MemTest86-20240523-093821.log	81
	3BEE9C00	0064	0	MemTest86-20240523-204116.log	13
			1	MemTest86-20240523-204116.log	112
	3C028A40	0064	0	MemTest86-20240523-125501.log	8
			1	MemTest86-20240523-125501.log	80
	3C269240	0064	0	MemTest86-20240515-154600.log	9
			1	MemTest86-20240515-154600.log	46
	3C80AF40	0064	0	MemTest86-20240523-005637.log	8
			1	MemTest86-20240523-005637.log	72
	3CA2BB00	0064	0	MemTest86-20240523-030920.log	8
			1	MemTest86-20240523-030920.log	68
	3CA60E00	0064	0	MemTest86-20240511-225019.log	9
			1	MemTest86-20240511-225019.log	51
	3D86BD40	0064	0	MemTest86-20240517-222533.log	4
			1	MemTest86-20240517-222533.log	18
	3D90B900	0064	0	MemTest86-20240517-224125.log	4
			1	MemTest86-20240517-224125.log	35
	3DC0CA40	0064	0	MemTest86-20240511-152937.log	18
			1	MemTest86-20240511-152937.log	59
	3EDA3540	0064	0	MemTest86-20240511-175134.log	9
			1	MemTest86-20240511-175134.log	50
	3F162A00	0064	0	MemTest86-20240523-183239.log	8
			1	MemTest86-20240523-183239.log	94
	4B06A9600	0064	3	MemTest86-20240520-072710.log	107
			4	MemTest86-20240520-072710.log	375
	4FA419C0	0064	1	MemTest86-20240517-220319.log	33
	5AB81D80	0064	0	MemTest86-20240523-171915.log	8
			1	MemTest86-20240523-171915.log	80
1-1	18353300	0085	0	MemTest86-20240523-111759.log	8
			1	MemTest86-20240523-111759.log	76
	33A8F0A00	0085	0	MemTest86-20240523-153532.log	8
			1	MemTest86-20240523-153532.log	68
	3B03DA80	0064	0	MemTest86-20240523-141321.log	8
			1	MemTest86-20240523-141321.log	72
	3C158080	0064	0	MemTest86-20240519-085944.log	4
			1	MemTest86-20240519-085944.log	37
	3C5397D80	0085	1	MemTest86-20240523-114200.log	17
	3C63F500	0064	0	MemTest86-20240523-210730.log	12
			1	MemTest86-20240523-210730.log	75
	3CA9EB00	0085	0	MemTest86-20240518-172625.log	4
			1	MemTest86-20240518-172625.log	35
			2	MemTest86-20240518-172625.log	51
			3	MemTest86-20240518-172625.log	298
			4	MemTest86-20240518-172625.log	112
	3CAD9640	0064	0	MemTest86-20240523-163328.log	12
			1	MemTest86-20240523-163328.log	82
	3D0BE680	0085	0	MemTest86-20240515-162614.log	9
			1	MemTest86-20240515-162614.log	44
			2	MemTest86-20240515-162614.log	45
			3	MemTest86-20240515-162614.log	353
			4	MemTest86-20240515-162614.log	49
	3DFFC200	0085	0	MemTest86-20240515-152450.log	9
			1	MemTest86-20240515-152450.log	46
Total Result					4570

The mapping between the Channel/Slot (above) and the Physical Slot is not clear to me and I haven’t been able to deduce it from the logs.
The above table does not indicate to me that there are specific addresses/locations that are the problem…and so neither do they indicate to me a problem with the memory itself.
Also waiting to find out on the PassMark MemTest86 forum what the Syndrome codes mean.

Samsung M393A2K43BB1-CRC

wisp.net.au
But it’s technically out of the 12-month warranty…it took ages to get the RAM and start building this box…my own fault.

Already did that. The board and RAM are DDR4-2400. BIOS let’s me choose from 2400, 2133,1866,1600. I tried 2133, then 1866. Made no difference - still ECC errors when multiple sticks in.

I am suspicious this might be the issue - not enough voltage to RAM. But this is not possible. There is no BIOS setting that allows this

Yep.

No.

This is my suspicion - and this mobo was hard to find, and it will be very hard to replace. 13x SATA ports + 4x ECC DDR4 slots + ITX size seems to be a rare combination of specs. I got one of last ones.

I am kicking off another test, this time with two sticks inserted into the WHITE slots.
I’ll see how they go…

nas · May 28, 2024, 3:06pm

WHITE slots under test…been running for 20 mins now with 2 sticks in the WHITE slots only and zero errors so far. This is the first positive sign I’ve had in the last 2 weeks. When all 4 sticks were in, ECC errors were literally popping up within seconds of commencing the test run.

It may turn out that I can only use 2 of the 4 slots
Damn, half the RAM…but that will be 32GB which should be enough to get on with using this box.

…and so to bed…

joeschmuck · May 28, 2024, 8:32pm

Lets say only the white slots are working perfectly once you are done testing. Next is to place an additional stick (3rd stick) into the blue slot A1. Run your testing again. If it passes then fantastic, you have a little more RAM to work with. If it fails, then move the stick to the other blue slot and try again.

What is the problem doing it this way if you can run 3 sticks? Well it is not supported. Since it is not supported I would run Memtest86 on it for many days, just to give yourself a good feeling that the system is stable. But wait! Stable you say? Run a CPU Stress Test at some point in time with the RAM in it’s final configuration, make sure it really is stable.

I suspect you either have a faulty slot connector or the cpu is not soldered to the board properly. It happens unfortunately. For this reason, if you can get it replaced for free, I’d do it.

Stux · May 28, 2024, 10:53pm

Has this board always had this problem?

Another cause can be if you have multiple memory devices on your dimm (i believe this is referred to as rank). Ie it’s sort of two dimms in one.

Then you’re effectively trying to use 4 dimms per channel in something only rated for two dimms per channel

This comes back to the QVL list for the board.

nas · May 28, 2024, 11:35pm

Next day…

MemTest86 Pass 1 done. 12 tests each pass.
Up to Test 8 of 12 in Pass 2.
8 hrs 49 minutes and counting.

Zero errors.

Since I finished building it, yes, I believe so. I haven’t started using it yet because of these issues.

All I know is that the specification states it can take “DDR4 1866/2133/2400 Dual-channel Max. 128 GB UDIMM, Non-ECC, ECC DIMM, RDIMM”, and it has four physical slots. Memory channels are a motherboard attribute, not a memory module attribute (at least, up to and including DDR4). That is to say that the number of channels is not determined by the kind of sticks that go in, it is determined by the number of paths between the CPU and the memory slots on the board.

I cannot see why they would put four slots on the board and make it so only two can be used at a time…that seems ridiculous to me, and it would be overengineering (they would not allow it in the design budget).

Am I missing something obvious?

Stux · May 28, 2024, 11:45pm

The white and blue slot pairs are both the same channel.

If you connect two dimms you have two dimms on the same channel.

There are electrical tolerances at play, and it’s possible for a marginally compatible device to not be compatible when there are two on the same channel.

Not all dimms that “meet the specifications” will work in all motherboards with said specifications.

I’m not sure how long you’ve had this board in your possession, but it appears to be incompatible with your memory when used 2 dimms per channel.

Maybe that was why it was returned and was thus open box in the first place?

Australian Consumer Law may be on your side in getting a return here. Assuming you are located in Australia.

Meanwhile, it’s throwing so many errors, it’s only a matter of time until you have two errors in the same location and thus can’t be corrected.

nas · May 29, 2024, 1:52am

What I deduced by inserting one module into one physical slot at a time:

SPD # does not map to physical slots, but rather to modules where the SPD EEPROM is successfully read during boot. An SPD EEPROM chip on a module may not be successfully read, but the module is still available as memory.
Note the BLUE and WHITE boxes around each BANK.

The documentation clearly states this is a dual channel motherboard. So, are you saying that the two BLUE slots are one channel, and the two WHITE slots are the other channel? Or do you mean there is an A channel (A1 & A2) and a B channel (B1 & B2)?

True. I have been wondering if this is what is happening.

I am the first owner AFAIK. I didn’t mean to intimate that. What I meant was that it took me over a year to complete my build and start testing…this is my own fault for being indecisive and disorganized.

Yes I’m here, but I don’t think it will help in this case.

It’s been under test now for 10 hrs 50 minutes with modules inserted in both WHITE slots (A2 and B2). I expect no errors for the rest of this test, so I think I’ll cut it short and swap to the BLUE slots (A1 and B1). If that appears to be going ok for a while, then I’ll insert a third module into one WHITE slot to see if I get errors. I’ll try all 3 slot population combinations. I’ll also try both A slots and both B slots, then I’ll know I have tried two modules in one channel as well as one module in both channels.

Stux · May 29, 2024, 2:23am

Generally the colors indicate banks, and a combination of colors is a channel.

But not always

And your motherboard manual may state which slot/colour to use first. It may even state exactly which to use with what combination of dimms.

When you put multiple dimms in the same channel it’s as if they are concatenated together.

Thus, if you have the same amount of memory in both channels, then you will get dual channel speed for all of your memory.

BUT if you have say 50% less in one channel then you will get dual channel only when access the parts of memory that have dimms in both channels.

Thus it’s better to install two dimms in two channels that one per bank in the same channel.

Vollans · May 29, 2024, 2:59am

Throwing something out left-field that I encountered with my SuperMicro X10 board a month or so back when I was testing things out - if I enabled what in the manual is called “memory scrambler” which says “This feature enables or disables memory scrambler support for memory error correction.” I got ECC errors. As soon as I turned that off, I didn’t anymore. Do you have a setting like that in your BIOS?

nas · May 29, 2024, 3:10am

“DDR4_A21” is obviously a typo in the above screenshot.

Stux · May 29, 2024, 3:17am

There is a difference between “not supported”, “not compatible” and “not validated”, and in this case “not allowed”

I think point 1) above should actually be “should” not “always need to”, matching pairs is good practise… but in theory a memory controller should pick the lowest common settings to run the memory at

is correct.
is probably correct
is probably correct, but as implied, “NOT supported” doesn’t mean it won’t work, especially since they merely “suggest” not using 3 memory modules… rather than stating it will not work.