Jump to content

The dreaded : Is it the memory, the motherboard, or the CPU memory controller?

Blakestr

TLDR: 

-Likely memory problems corrupted file system

-Memory passed YCruncher/TM5/Memtest without errors

-Newly formatted system still encountering memory problems like "unhandled exception" crashes 

 

Why are the problems in a new windows install SO OBVIOUS but there aren't errors happening in testing??? 

 

So apparatenly everything changed in the last few years when it came to stability on RAM.  I built a new system and I've used it as a development rig without problems for the past 2 months but in the last few weeks it's started to experience more crashing.  A lot of "unhandled exception at address 0xxxxE183580" type of errors.  In the last two days, these became excessive, resulting in crashes is things like Discord, various games including old and new.  All signs point to a file system corruption, almost certainly due to memory problems.  After updating BIOS with no change, I wiped the system and kept testing. 


Specs:

i9 13900k
Asrock Taichi z790

Gskill 128 gb Ripjaw S5 DDR5, XMP 5200 * (ran at 4400 XMP)  (yes this is the crappy samsung die too) 
Asus 3090 TI with AIO 

Custom loop is cooling the i9, normal temps range from 40-60, a few cores hit 100 during stress tests 

I knew going into the build that getting 128gb to run at even stock speeds would be a challenge.  For my use case, I was okay with it running at a lower speed. Unfortunately I didn't get the memo about doing more testing than just memtest.  So when I build the system, after doing leak/thermal tests, I ran memtest overnight, saw zero errors and figured, since I wasn't trying to do any memory overclocking, it was going to be stable.  Clearly I was ignorant about how these newer generation memory corruptor-I mean, memory controllers are.  

After getting some advice, I found a github page for dd4 which helped with some testing info.  I ran Ycruncher on N64&VST for about 15 minutes each stick.  Initially I had one stick fail right off the bat.  But I didn't know enough about how the windows file corruption might be affecting the testing, so I formatted  completely and used a win USB stick that was made on a different computer.  Then I spent the time testing all the sticks.  I couldn't get any of them to fail at rated speeds, XMP 5200, XMP4400, there were zero errors.  I put all of the ram back in and tested, and found the errors went away at 4200.  But just to be on the safe side, I put it to XMP 3600, just to see if I could get some clear stability.  

I ran TM5 on the extreme777 config (confirmed btw) and it ran for 3 cycles overnight.  Zero errors.  (It did have a crash on TM5 itself but I could still see the program counting and running up and I asked and someone said that was just the program itself, not a sign of instability).  I thought I was good and that my previous settings were either too high/bad timings and over the months of use, this resulted in the corruption of the file system. 
I thought I was good but shortly after reinstalling windows, a fresh install, a program like DISCORD wouldn't load.  Maybe that's a fluke, but then the Epic Gamestore launcher poppped up.  

Okay, it could STILL be the memory.  At this time I'm running another Ycruncher test.. I've heard things about using Prime95 and use a slow something FLT or whatever it's called  but is there ANY other thing I can do to ensure it's not just a bad stick.  Currently I'm just using just half the sticks, 32 x 2, in the a2/b2 slots because my motherboard manual recommends that.  I do note that the sticks were hot to the touch, almost uncomfortably.  

And can someone recommend a video to learn more about how memory corruption affects a file system?  For example, I don't know if I were to toss some brand new RAM in there, would discord/epic still crash?  Again, these files were re-downloaded onto a fresh hard drive.  

I don't have enough rig/test bench to swap out the components, so I need to make the best educated guess about what to replace.  I can order a new cpu/motherboard/ram, but then how do I know what to keep and what to get rid off.  I don't think I can just RMA all the components with the "it's probably bad." But thank you for reading.  Not that it matters but I am using this rig to build something that can help humanity, so there's that. 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Intel offers a CPU checker software, test CPU.

Run 1 RAM stick (without XMP on) for a while, and see what happens

 

Report back

NOTE: I no longer frequent this site. If you really need help, PM/DM me and my e.mail will alert me. 

Link to comment
Share on other sites

Link to post
Share on other sites

Passed.  Is this test good enough to detect problems with the CPU's memory controller? 

--- IPDT64 - Revision: 4.1.7.39
--- IPDT64 - Start Time: 4/26/2023 3:05:20 PM

----------------------------------------------
-- Testing
---------------------------------------------- 
CPU 1 - Genuine Intel - Pass. CPU 1 - BrandString - Pass.CPU 1 - Cache - Pass. CPU 1 - MMXSSE - Pass. CPU 1 - IMC - Pass. CPU 1 - Prime Number - Pass.
CPU 1 - Floating Point - Pass. CPU 1 - Math - Pass. CPU 1 - GPUStressW - Pass. CPU 1 - CPULoad - Pass. CPU 1 - CPUFreq - Pass.
IPDT64 Passed
--- IPDT64 - Revision: 4.1.7.39
--- IPDT64 - End Time: 4/26/2023 3:09:08 PM

----------------------------------------------
PASS

 

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, Blakestr said:

And can someone recommend a video to learn more about how memory corruption affects a file system?  For example, I don't know if I were to toss some brand new RAM in there, would discord/epic still crash?  Again, these files were re-downloaded onto a fresh hard drive.

Not aware of any videos that cover it, but what can happen is if your RAM is unstable, it can corrupt Windows files as Windows is moving them around in the background that those programs may rely on. It's not an issue that would survive a fresh install with working RAM settings unless the drive itself was bad, at least in my experience.

 

10 minutes ago, Blakestr said:

I've heard things about using Prime95 and use a slow something FLT or whatever it's called  but is there ANY other thing I can do to ensure it's not just a bad stick

It's Prime95 Large FFTs that people were referring to, though if each individual stick passes TM5 Extreme777 at stock speeds, I'd say that the sticks probably aren't the problem and it's more likely a motherboard or CPU issue. 

 

12 minutes ago, Blakestr said:

I do note that the sticks were hot to the touch, almost uncomfortably.  

That's not out of the unusual with DDR5, because the PMIC is on board the sticks run incredibly hot and can easily get over 50-60C depending on case airflow conditions. Most DDR5 is temperature sensitive, but it's rated to work at temps up to (IIRC) 80C at JEDEC settings, so since you're running at below JEDEC at not custom subtimings the temps aren't likely an issue. 

 

 

You've done so much testing of the memory that it being a memory issue is rather unlikely at this point. The three things that I'd suspect being the issue are the CPU itself, the motherboard, and the SSD. If you have a spare SSD to load up Windows on and see if it starts working, that would be ideal. If you don't, they are really cheap so I'd still consider doing that (who doesn't need more storage for whatever reason?), but otherwise move on to the other steps. If it's the motherboard, there are occasionally stability issues from different BIOS revisions, checking a different one might help you, though if it is the motherboard usually you have to RMA. If it's the CPU, usually overvolting it slightly should cause it to stop having such weird issues, try running some CPU stress tests and see if it crashes, and if it does try adding a 50mV positive voltage offset to see if that gets it to work (might want to test that offset no matter what as it could be that it only has issues in light loads where a single thread is boosting to 5.8GHz). 

Link to comment
Share on other sites

Link to post
Share on other sites

I didn't think about the hard drive. Yeah I have two other NVME's, I can load a fresh install of windows on one and see.  (It was nice having an 850SN 4 TB as an OS drive, I'll tell you) - I'll install when I get home tonight and will report back. 

 

I'd be worry about applying voltage, not because of overheating but because the damn chip should be stable without doing anything to it.  It's watercooled and the temps are fine.  

 

What do I report to RMA this board because that's what I am leaning towards, it's been a long time since I RMA'ed anything and everytime it was simple because, "It's dead" was what I would write.  I just don't want to get caught in the "How do you know it is OUR component failing, and not the CPU/memory???" (which is a fair question)

- I'm going to see if a local PC shop will let me pay them 50-100 bucks to borrow a test bench to confirm the RAM/CPU are good.  

 

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Blakestr said:

I'd be worry about applying voltage, not because of overheating but because the damn chip should be stable without doing anything to it.  It's watercooled and the temps are fine.  

Oh definitely, it's just that 50mV extra shouldn't kill it and it's just more for troubleshooting purposes than anything. If increasing the voltage causes the chip to stop crashing, then send it back to Intel, say "it was failing stress tests" and get a new one. I wouldn't run that offset for more than about a day, but if your system is as unstable as you say it is a day should be plenty of time to confirm if it's working or not. 

 

6 minutes ago, Blakestr said:

What do I report to RMA this board because that's what I am leaning towards, it's been a long time since I RMA'ed anything and everytime it was simple because, "It's dead" was what I would write.  I just don't want to get caught in the "How do you know it is OUR component failing, and not the CPU/memory???" (which is a fair question)

Yeah, that is a bit tough. It's been lucky enough to never have to RMA a motherboard, so this is more from what I've seen and take that for what you will. A lot of the time if you say the board is giving me weird issues with random blue screens, they will take it back and test it themselves. If they find no issues, they send it back to you and say "it's something else," and if they do, they send you a new one. You're mileage may vary though. 

 

8 minutes ago, Blakestr said:

- I'm going to see if a local PC shop will let me pay them 50-100 bucks to borrow a test bench to confirm the RAM/CPU are good.  

If that's an option, that's a pretty good option. 

 

 

Also make sure to quote so we get a notification. 

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, RONOTHAN## said:

You've done so much testing of the memory that it being a memory issue is rather unlikely at this point. The three things that I'd suspect being the issue are the CPU itself, the motherboard, and the SSD. If you have a spare SSD to load up Windows on and see if it starts working, that would be ideal. 

I only had about 20 minutes to check because I was about to fall asleep - I installed Windows (again, a clean usb drive made on another computer) to another m2 drive in the rig.  So far the errors seem to not be there, but I'm still super leary.  I'm going to leave

 

6 hours ago, Radium_Angel said:

Report back

it running all night downloading something, but I'm looking for ideas on how to test this - before, it was Discord crashing, chrome would crash, the Epic Games Store would crash.  Other than just using those things and noticing that they are not crashing, is there some other type of "stress test" I could run on the machine?  An dead m2 drive is probably the best scenario in terms of easy of fixing it, IF that is the problem.  Crap, is Crystal Disk still the go to checker to rule out the drive problem too? 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Blakestr said:

 Crap, is Crystal Disk still the go to checker to rule out the drive problem too? 

It's one of the better options, though it's not perfect and I've seen plenty of dying drives not give any report of error in there. Feel free to check, but if the issue went away when you swapped the SSD, I'd say it was probably the SSD that was bad. 

 

3 minutes ago, Blakestr said:

Other than just using those things and noticing that they are not crashing, is there some other type of "stress test" I could run on the machine?

Those are probably the first things you should check, and if it was the SSD it's probably the only thing you can check. Do try to do some more system stress tests like Linpack, Prime95, etc., but the best way is to just use the system and see if you get any errors. 

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, RONOTHAN## said:

It's one of the better options, though it's not perfect and I've seen plenty of dying drives not give any report of error in there. Feel free to check, but if the issue went away when you swapped the SSD, I'd say it was probably the SSD that was bad. 

 

Those are probably the first things you should check, and if it was the SSD it's probably the only thing you can check. Do try to do some more system stress tests like Linpack, Prime95, etc., but the best way is to just use the system and see if you get any errors. 

Going to bed now - but I did keep the other windows installation on the suspect drive.  In the morning I will try booting into it and running different programs and then running them in the "safe" drive and see if there's an obvious difference. 

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, RONOTHAN## said:

It's one of the better options, though it's not perfect and I've seen plenty of dying drives not give any report of error in there. Feel free to check, but if the issue went away when you swapped the SSD, I'd say it was probably the SSD that was bad. 

 

Those are probably the first things you should check, and if it was the SSD it's probably the only thing you can check. Do try to do some more system stress tests like Linpack, Prime95, etc., but the best way is to just use the system and see if you get any errors. 

Well it doesn't seem to be the SSD - I'm getting the same errors.  And unless the Epic Games Store launcher and Discord both have some magical bug that is only affecting my machine with a new fresh install, there is something else going on.  

I'm about 90% sure it's the motherboard, but I'd like to be 100% - in January when I built this system I ran the CPU through Cinebench and some of the 3D mark tests, just to have a baseline - but it wasn't like I did it for 24 hours straight (because, frankly, I shouldn't have to to find problems with a CPU) 

Any other ideas? 

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, Blakestr said:

Any other ideas? 

Not really, it just sounds like the motherboard is bad. I'd be contacting ASRock and letting them know that your board is having weird issues and you'd like a replacement. 

Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, RONOTHAN## said:

Not really, it just sounds like the motherboard is bad. I'd be contacting ASRock and letting them know that your board is having weird issues and you'd like a replacement. 

Yeah I'm going to do that.  But I wanted to ask, someone mentioned I could consider getting an ECC board, like the w680 (which, I guess is the only platform that can do ECC on DDR5?  Maybe this troubleshooting experience has made me gunshy and I'm looking for a way to prevent this from happening.  I'm not sure what difference that would make, a w680 motherboard can fail - although I'm not sure in what way it will would fail.  

 

I'd feel better if I had some understanding about what caused the failure, which I doubt is possible.  Mainly though, I want to confirm that just because 128gb DDR5 is harder to run, that people are having problems with the 13th gen memory controlls, that doing so would have ZERO effect on the board itself.  I could burn up the RAM or make it crash city, but just having unstable timings shouldn't damage the board right?  Because if that is the case, I definately will switch to a new motherboard and just sell the RMA board when I get a new one back from Asrock.  

 

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Blakestr said:

but just having unstable timings shouldn't damage the board right?

No, it doesn't damage the board. If it did I'd have a lot more dead motherboards on my hands (I'm pretty into memory overclocking, where you usually do have to get a lot of the timings to horribly unstable settings before getting them stable). It can break the OS, and it can corrupt the BIOS in a worst case scenario (I have had that happen), but that's nothing that can't be fixed with a Windows reinstall and a BIOS flash. 

 

3 minutes ago, Blakestr said:

someone mentioned I could consider getting an ECC board, like the w680 (which, I guess is the only platform that can do ECC on DDR5?

AFAIK AM5 also has some support for ECC DDR5, though from what I've seen getting a kit is pretty hard to come by. Getting ECC isn't a bad ideaif you need 128GB of RAM, though the costs of unbuffered ECC RAM is usually very high that you might end up spending double the amount on RAM as you currently are. If you can afford it and can come up with some sort of VRM cooling for a W680 board (all the ones I'm aware of have horrible VRMs that will power throttle a 13900K), it's a decent idea to look into. 

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, RONOTHAN## said:

If you can afford it and can come up with some sort of VRM cooling for a W680 board (all the ones I'm aware of have horrible VRMs that will power throttle a 13900K), it's a decent idea to look into. 

Ugh, so I'd have to look into getting a VRM waterblock for the motherboard? I couldn't rely on the custom loop cooling the CPU enough? 

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Blakestr said:

Ugh, so I'd have to look into getting a VRM waterblock for the motherboard? I couldn't rely on the custom loop cooling the CPU enough? 

I forgot this board existed, this should be OK with a 13900k as long as there's a bit of airflow over the VRM. 

https://www.asus.com/motherboards-components/motherboards/workstation/pro-ws-w680-ace-ipmi/

 

Most of the W680 boards look closer to this monstrosity and do actually need VRM water cooling to have a shot of powering a 13900K without power throttling:

https://www.newegg.com/supermicro-mbd-x13sae-f-o-supports-12th-generation-intel-core-i3-i5-i7-i9-processors/p/N82E16813183808?Description=w680 motherboard&cm_re=w680_motherboard-_-13-183-808-_-Product

 

Also, a custom loop counter-intuitively actually has a VRM run hotter because it's less likely to get direct airflow like it would from a CPU's air cooler. It's not like that matters for the vast majority of high end boards these days, the Taichi you're currently running could be run without a heatsink with zero airflow and should still be OK because the VRM is so overkill, but it will technically run hotter with a custom loop than an air cooler. 

Link to comment
Share on other sites

Link to post
Share on other sites

On 4/27/2023 at 1:06 PM, RONOTHAN## said:

I forgot this board existed, this should be OK with a 13900k as long as there's a bit of airflow over the VRM. 

https://www.asus.com/motherboards-components/motherboards/workstation/pro-ws-w680-ace-ipmi/

 

On 4/26/2023 at 6:03 PM, Radium_Angel said:

 

 

Report back

 

Just an update - 

I've received the motherboard and new ECC ram.  I installed them and reinstalled windows.  

 

I am STILL getting the same errors.  This is fucking nuts.  

Ran the CPU through Prime Blend for 30 minutes no errors - also ran it through the OCCT tool for the large/small AVX2 and a large SSE or whatever they are called - no obvious errors - reported a minor PCI error from event viewer that shows a ID of 17 with a corrupted.

I have no idea what to do now - if there is a problem with the memory controller in this CPU, why can't I get it to manifest during a stress test - I can't even open up my unreal engine project file 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Blakestr said:

Just an update - 

I've received the motherboard and new ECC ram.  I installed them and reinstalled windows.  

 

I am STILL getting the same errors.  This is fucking nuts.  

Ran the CPU through Prime Blend for 30 minutes no errors - also ran it through the OCCT tool for the large/small AVX2 and a large SSE or whatever they are called - no obvious errors - reported a minor PCI error from event viewer that shows a ID of 17 with a corrupted.

I have no idea what to do now - if there is a problem with the memory controller in this CPU, why can't I get it to manifest during a stress test - I can't even open up my unreal engine project file 

You ran this, right?

https://www.intel.com/content/www/us/en/download/15951/intel-processor-diagnostic-tool.html

 

Your BIOS options are all at stock, correct?

 

NOTE: I no longer frequent this site. If you really need help, PM/DM me and my e.mail will alert me. 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Radium_Angel said:

You ran this, right?

https://www.intel.com/content/www/us/en/download/15951/intel-processor-diagnostic-tool.html

 

Your BIOS options are all at stock, correct?

 

Before I am swapped the motherboard, I cleared CMOS and make sure the bios was reset.  The only options I changed was power restore on AC loss.

 

I ran the Intel testing tool on that machine.

 

I changed out the motherboard and I'm still having the same errors.  Bios obviously came stock, and even though it's only been a day I can't remember everything I changed it wasn't much.

 

I've not run the Intel tester on this one I'll do that now.  The event viewer is not showing anything that I can find that is related to these crashes, at least when the crashes happen I cannot see a new thing that happened in the event viewer.  I had a tertiary video card that I used to power just a single display that I added I will pull when I get home just to make sure that that's not somehow causing some system-wide memory access thing.

 

Again this error is so freaking weird.

 

I downloaded a new copy of The Unreal editor and even it is crashing. I have to double-check to make sure that it's not the project files that have somehow become corrupted - It seemed to be a core engine file which would have been new since I downloaded it fresh on this new PC directly from the Unreal server

Link to comment
Share on other sites

Link to post
Share on other sites

Also it's worth noting that all these memory addresses are coming back null.  Whenever there's a crash it appears to be related to some pointer not found

Link to comment
Share on other sites

Link to post
Share on other sites

ok do this:

Download to a USB stick a LiveUSB install of something like Linux Mint (it's very windows like) and run that for a while off the USB stick.

Linux is very intolerant of bad hardware. If it behaves, then you have a software sorruption

If it crashes, its def. hardware.

NOTE: I no longer frequent this site. If you really need help, PM/DM me and my e.mail will alert me. 

Link to comment
Share on other sites

Link to post
Share on other sites

51 minutes ago, Radium_Angel said:

ok do this:

Download to a USB stick a LiveUSB install of something like Linux Mint (it's very windows like) and run that for a while off the USB stick.

Linux is very intolerant of bad hardware. If it behaves, then you have a software sorruption

If it crashes, its def. hardware.

Would this be even more apparent if I ran it on a VM (or less)?

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Blakestr said:

Would this be even more apparent if I ran it on a VM (or less)?

Possibly, but running on the USB stick, *removing windows entirely from the equation*, is the best way to go

NOTE: I no longer frequent this site. If you really need help, PM/DM me and my e.mail will alert me. 

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Radium_Angel said:

Possibly, but running on the USB stick, *removing windows entirely from the equation*, is the best way to go

Understood, I've booted up Linux and now realize I haven't used Linux in like 10 years - for example there's not a easy way to get the epic Unreal editor onto Linux without messing with GitHub - is there another stress test I could just run on Linux instead?

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Blakestr said:

Understood, I've booted up Linux and now realize I haven't used Linux in like 10 years - for example there's not a easy way to get the epic Unreal editor onto Linux without messing with GitHub - is there another stress test I could just run on Linux instead?

Steam is native...

NOTE: I no longer frequent this site. If you really need help, PM/DM me and my e.mail will alert me. 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Radium_Angel said:

Steam is native...

But check this out - I was trying to download something in terminal I think it was (sysbench) and I got to pop that said 'cinnamon just crashed' & "You are in fall back mode." 

 

Then it asked me if I want to restart Cinnamon.  

 

I guess that helps prove that it's hardware then? Certainly can't be windows at least that makes me feel better

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×