Jump to content

A while back I got a dual Xeon server for cheap. It came with 16x 4GB DDR3 ECC registered modules. It worked fine for a while, but after some time it started crashing to a blank screen without explanation. By trial and error, I found if I took out half the ram it would work again so I left it like that to keep it going.

 

More recently I got a Chinese X79 mobo. I put in 4 of those sticks and it was fine. Today I got around to swapping the other 4 sticks in. Running memtest, it crashed to black screen, just like the other system! Post code 69. Nice. Ok, one or more of these 4 sticks must be bad. I took out two of them, same crash. Move it to other slots, just in case. Different post code 01, but still blank screen and no boot. Put in the other two sticks of that 4, booted and running memtest no problem.

 

If I understand correctly, ECC is supposed to be able to detect and correct single bit errors, and may be able to detect but not correct more bit errors than that. What is the expected behaviour in the case ECC can't correct a detected error? Does it kill the system to prevent any further data corruption? I presume a server that doesn't run is preferable to one that is unstable and could produce corrupt data.

 

BTW, the ECC registered ram runs HOT with memtest. I didn't measure the temperature, but it was painful to touch. Guess that wont help either... I also put in 2x 2GB DDR3 1333 non-ECC sticks I found in the electronic trash at work, and they passed memtest too. Those barely got warm. Does ECC registered use that much more power, thus heat?

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, MSI Ventus 3x OC RTX 5070 Ti, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Alienware AW3225QF (32" 240 Hz OLED)
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 4070 FE, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, iiyama ProLite XU2793QSU-B6 (27" 1440p 100 Hz)
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
https://linustechtips.com/topic/1087231-when-ecc-goes-bad/
Share on other sites

Link to post
Share on other sites

The detection is in the ECC module (usually in the center of the module), can correct a slight binary error (software) but cannot correct physical error. If the module is defected, then it is broken. I can assume that some of your module are indeed bad. Try the module in memtest ONE AT A TIME.

 

Ryzen 5700g @ 4.4ghz all cores | Asrock B550M Steel Legend | 3060 | 2x 16gb Micron E 2666 @ 4200mhz cl16 | 500gb WD SN750 | 12 TB HDD | Deepcool Gammax 400 w/ 2 delta 4000rpm push pull | Antec Neo Eco Zen 500w

Link to comment
https://linustechtips.com/topic/1087231-when-ecc-goes-bad/#findComment-12755227
Share on other sites

Link to post
Share on other sites

48 minutes ago, porina said:

BTW, the ECC registered ram runs HOT with memtest. I didn't measure the temperature, but it was painful to touch.

Humans have a very low tolerance for heat.  35°C only feels lukewarm but anything above 40°C is already too painful to hold for more than a couple of seconds.  So properly measuring the temperature is the only way to know if it indeed runs hot. 

 

As @SupaKomputa says, test the RAM one stick at a time to find the bad one.  Also check if the manufacturer gives a lifetime warranty (many do) and get in touch with them if they do.

Link to comment
https://linustechtips.com/topic/1087231-when-ecc-goes-bad/#findComment-12755254
Share on other sites

Link to post
Share on other sites

3 minutes ago, Captain Chaos said:

Also check if the manufacturer gives a lifetime warranty (many do) and get in touch with them if they do.

I don't think these server ram have lifetime warranty, aren't ddr3 suppose to be EOL by now?

Just adding, all of your rams are ex-server parts, they've been through strenuous 24/7 operations, some rams should go bad quicker than new rams. 

Ryzen 5700g @ 4.4ghz all cores | Asrock B550M Steel Legend | 3060 | 2x 16gb Micron E 2666 @ 4200mhz cl16 | 500gb WD SN750 | 12 TB HDD | Deepcool Gammax 400 w/ 2 delta 4000rpm push pull | Antec Neo Eco Zen 500w

Link to comment
https://linustechtips.com/topic/1087231-when-ecc-goes-bad/#findComment-12755268
Share on other sites

Link to post
Share on other sites

20 minutes ago, SupaKomputa said:

The detection is in the ECC module (usually in the center of the module), can correct a slight binary error (software) but cannot correct physical error. If the module is defected, then it is broken. I can assume that some of your module are indeed bad. Try the module in memtest ONE AT A TIME.

Thanks for the reply but that doesn't give any new info I hadn't already put in my original post. I narrowed it down to two sticks and I don't feel a need to work out which one it is, since I use ram in pairs or multiples thereof.

 

The question remains, is it expected behaviour for the system to stop functioning with uncorrectable errors in ECC ram?

12 minutes ago, Captain Chaos said:

Humans have a very low tolerance for heat.  35°C only feels lukewarm but anything above 40°C is already too painful to hold for more than a couple of seconds.  So properly measuring the temperature is the only way to know if it indeed runs hot. 

There is a rule of thumb that if it is too hot to touch, it is probably over 50C. Of course, this will vary from person to person and isn't exact. I did get my IR camera out but the battery was flat and the ram had cooled down to a warm 35C by the time I could use it.

 

I don't care so much about the absolute temperature. I think these modules have a sensor so with the right software I could probably read it from that anyway. I don't think that function was in memtest. Oh, I don't know where the problem lies, but the SPD isn't reported in Windows although memtest was able to extract it. So maybe the software tools may not be able to read the ram temperature either... I haven't tried it yet.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, MSI Ventus 3x OC RTX 5070 Ti, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Alienware AW3225QF (32" 240 Hz OLED)
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 4070 FE, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, iiyama ProLite XU2793QSU-B6 (27" 1440p 100 Hz)
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
https://linustechtips.com/topic/1087231-when-ecc-goes-bad/#findComment-12755292
Share on other sites

Link to post
Share on other sites

7 hours ago, porina said:

The question remains, is it expected behaviour for the system to stop functioning with uncorrectable errors in ECC ram?

No. If an uncorrectable error would give a total system meltdown, that's counter productive aint it? Might aswell have an ordinary ram.

 

Quote

Thanks for the reply but that doesn't give any new info I hadn't already put in my original post. I narrowed it down to two sticks and I don't feel a need to work out which one it is, since I use ram in pairs or multiples thereof.

Why? if only one module is broken, you don't need to change the pairs, just find a similar module and it would work in whatever channel you have.

Ryzen 5700g @ 4.4ghz all cores | Asrock B550M Steel Legend | 3060 | 2x 16gb Micron E 2666 @ 4200mhz cl16 | 500gb WD SN750 | 12 TB HDD | Deepcool Gammax 400 w/ 2 delta 4000rpm push pull | Antec Neo Eco Zen 500w

Link to comment
https://linustechtips.com/topic/1087231-when-ecc-goes-bad/#findComment-12756038
Share on other sites

Link to post
Share on other sites

3 hours ago, SupaKomputa said:

No. If an uncorrectable error would give a total system meltdown, that's counter productive aint it? Might aswell have an ordinary ram.

There will be correctable errors, that's not the important thing here. If you have a workload that is important enough that errors are not tolerated, it makes sense that detected but uncorrected errors should prevent the system from continuing its work. How that is done is my question. Does it just shut shown essentially?

 

3 hours ago, SupaKomputa said:

Why? if only one module is broken, you don't need to change the pairs, just find a similar module and it would work in whatever channel you have.

I have a surplus of ECC ram and don't intend to replace.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, MSI Ventus 3x OC RTX 5070 Ti, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Alienware AW3225QF (32" 240 Hz OLED)
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 4070 FE, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, iiyama ProLite XU2793QSU-B6 (27" 1440p 100 Hz)
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
https://linustechtips.com/topic/1087231-when-ecc-goes-bad/#findComment-12756333
Share on other sites

Link to post
Share on other sites

Bad sticks of ram are causing your issues. Not ram neglecting to correct errors. Two separate issues. Bad sticks are just bad sticks, cant expect them to fix themselves.

Black Knight-

Ryzen 5 5600, GIGABYTE B550M DS3H, 16Gb Corsair Vengeance LPX 3000mhz, Asrock RX 6800 XT Phantom Gaming,

Seasonic Focus GM 750, Samsung EVO 860 EVO SSD M.2, Intel 660p Series M.2 2280 1TB PCIe NVMe, Linux Mint 20.2 Cinnamon

 

Daughter's Rig;

MSI B450 A Pro, Ryzen 5 3600x, 16GB Corsair Vengeance LPX 3000mhz, Silicon Power A55 512GB SSD, Gigabyte RX 5700 Gaming OC, Corsair CX430

Link to comment
https://linustechtips.com/topic/1087231-when-ecc-goes-bad/#findComment-12756428
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×