When ECC goes bad

porina · July 27, 2019

A while back I got a dual Xeon server for cheap. It came with 16x 4GB DDR3 ECC registered modules. It worked fine for a while, but after some time it started crashing to a blank screen without explanation. By trial and error, I found if I took out half the ram it would work again so I left it like that to keep it going.

More recently I got a Chinese X79 mobo. I put in 4 of those sticks and it was fine. Today I got around to swapping the other 4 sticks in. Running memtest, it crashed to black screen, just like the other system! Post code 69. Nice. Ok, one or more of these 4 sticks must be bad. I took out two of them, same crash. Move it to other slots, just in case. Different post code 01, but still blank screen and no boot. Put in the other two sticks of that 4, booted and running memtest no problem.

If I understand correctly, ECC is supposed to be able to detect and correct single bit errors, and may be able to detect but not correct more bit errors than that. What is the expected behaviour in the case ECC can't correct a detected error? Does it kill the system to prevent any further data corruption? I presume a server that doesn't run is preferable to one that is unstable and could produce corrupt data.

BTW, the ECC registered ram runs HOT with memtest. I didn't measure the temperature, but it was painful to touch. Guess that wont help either... I also put in 2x 2GB DDR3 1333 non-ECC sticks I found in the electronic trash at work, and they passed memtest too. Those barely got warm. Does ECC registered use that much more power, thus heat?

SupaKomputa · July 27, 2019

The detection is in the ECC module (usually in the center of the module), can correct a slight binary error (software) but cannot correct physical error. If the module is defected, then it is broken. I can assume that some of your module are indeed bad. Try the module in memtest ONE AT A TIME.

dfsdfgfkjsefoiqzemnd · July 27, 2019

48 minutes ago, porina said:

BTW, the ECC registered ram runs HOT with memtest. I didn't measure the temperature, but it was painful to touch.

Humans have a very low tolerance for heat. 35°C only feels lukewarm but anything above 40°C is already too painful to hold for more than a couple of seconds. So properly measuring the temperature is the only way to know if it indeed runs hot.

As @SupaKomputa says, test the RAM one stick at a time to find the bad one. Also check if the manufacturer gives a lifetime warranty (many do) and get in touch with them if they do.

SupaKomputa · July 27, 2019

3 minutes ago, Captain Chaos said:

Also check if the manufacturer gives a lifetime warranty (many do) and get in touch with them if they do.

I don't think these server ram have lifetime warranty, aren't ddr3 suppose to be EOL by now?

Just adding, all of your rams are ex-server parts, they've been through strenuous 24/7 operations, some rams should go bad quicker than new rams.

porina · July 27, 2019

20 minutes ago, SupaKomputa said:

The detection is in the ECC module (usually in the center of the module), can correct a slight binary error (software) but cannot correct physical error. If the module is defected, then it is broken. I can assume that some of your module are indeed bad. Try the module in memtest ONE AT A TIME.

Thanks for the reply but that doesn't give any new info I hadn't already put in my original post. I narrowed it down to two sticks and I don't feel a need to work out which one it is, since I use ram in pairs or multiples thereof.

The question remains, is it expected behaviour for the system to stop functioning with uncorrectable errors in ECC ram?

12 minutes ago, Captain Chaos said:

Humans have a very low tolerance for heat. 35°C only feels lukewarm but anything above 40°C is already too painful to hold for more than a couple of seconds. So properly measuring the temperature is the only way to know if it indeed runs hot.

There is a rule of thumb that if it is too hot to touch, it is probably over 50C. Of course, this will vary from person to person and isn't exact. I did get my IR camera out but the battery was flat and the ram had cooled down to a warm 35C by the time I could use it.

I don't care so much about the absolute temperature. I think these modules have a sensor so with the right software I could probably read it from that anyway. I don't think that function was in memtest. Oh, I don't know where the problem lies, but the SPD isn't reported in Windows although memtest was able to extract it. So maybe the software tools may not be able to read the ram temperature either... I haven't tried it yet.

SupaKomputa · July 28, 2019

7 hours ago, porina said:

The question remains, is it expected behaviour for the system to stop functioning with uncorrectable errors in ECC ram?

No. If an uncorrectable error would give a total system meltdown, that's counter productive aint it? Might aswell have an ordinary ram.

Quote

Thanks for the reply but that doesn't give any new info I hadn't already put in my original post. I narrowed it down to two sticks and I don't feel a need to work out which one it is, since I use ram in pairs or multiples thereof.

Why? if only one module is broken, you don't need to change the pairs, just find a similar module and it would work in whatever channel you have.

porina · July 28, 2019

3 hours ago, SupaKomputa said:

No. If an uncorrectable error would give a total system meltdown, that's counter productive aint it? Might aswell have an ordinary ram.

There will be correctable errors, that's not the important thing here. If you have a workload that is important enough that errors are not tolerated, it makes sense that detected but uncorrected errors should prevent the system from continuing its work. How that is done is my question. Does it just shut shown essentially?

3 hours ago, SupaKomputa said:

Why? if only one module is broken, you don't need to change the pairs, just find a similar module and it would work in whatever channel you have.

I have a surplus of ECC ram and don't intend to replace.

asand1 · July 28, 2019

Bad sticks of ram are causing your issues. Not ram neglecting to correct errors. Two separate issues. Bad sticks are just bad sticks, cant expect them to fix themselves.

Sign In

When ECC goes bad

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

The Future of PC Cooling?

Latest From ShortCircuit:

The coolest looking monitor. Period. - ASUS ROG display at Computex (Sponsored)

Latest From TechLinked:

Microsoft Just Can’t Help Itself

Latest From GameLinked:

This Was A GOOD One...

Latest From Tech Quickie:

Who's Tracking Your Phone Right Now?

Latest From The WAN Show:

Pizza Hut is Being Sued Over AI