Jump to content

Bizzare behaviour, multiple BSODs, CTDs in multiple games,YT, can't seem to find a culprit.

DeLawrence

Hey all, for the past week or so I have been driving myself nuts with a problem on my main machine that I haven't yet 100% managed to diagnose.

Specs:

CPU - AMD Ryzen 5 1600 @3.7GHz (OC)(*1) cooled by a Corsair H80i

Motherboard - Gigabyte AB350 Gaming 3, F51b BIOS version

RAM - 16GB (4x4GB) Corsair Vengeance LPX (*2)

GPU - Sapphire RX580 Nitro+ 8GB stock clocks (bought new in 2018)

PSU - Seasonic Focus GX 650W, 80+gold, Fully modular

Storage (SSD) - 1TB Intel 660p nVME, 250GB Samsung 860 SATA, 120GB AMD R3 SATA

Storage (HDD) - 1TB WD Blue, 500GB WD Green

Peripherals(*3), on USB :

   - Logitech G Pro Mouse

   - Redragon Yama Keyboard

   - ASUS Xonar U7 MK II sound card

   - Dell P2314H USB hub ( ASUS USB BT-400 Bluetooth Adapter, only thing in this monitor hub)

Peripherals on GPU:

   - Dell P2314H 1080p 60Hz monitor - Display Port (main)

   - LG 32LF561V 1080p 60Hz TV - HDMI (secondary)

 

OS: WIndows 10 Pro x64

 

Notes:

*1) The CPU has been overclocked since August of 2017 after I purchased it. It took around 7 hours to fiddle with the clocks and properly test it, followed by an overnight Prime95 of around 10hrs. Every time I updated the BIOS I ran at least 1hr of AIDA64 stress test on the whole system to make sure it is still stable. It always has been.

*2) The memory modules are mismatched. Initially I had one 8GB kit of 3000MHz CL15 with SK Hynix chips. XMP has never worked on the motherboard, but manually setting the timings did. Later I upgraded but unfortunately couldn't find a matching kit so I settled for a 3000MHz CL16 kit with Micron chips(didn't know in advance). Anyway, bottom line is, since XMP was still not working, even after a few BIOS revisions, I used Ryzen DRAM calculator and settled for a 2933, 16-16-16-28(and the subsequent settings provided by the calculator ofc) configuration. It has been rock solid, did 8 full passes of MEMTEST  back then.

*3) The reason I include the peripherals is because I had experienced some funky USB disconnects or downright disappearance of devices when updating a graphics driver. Recently, upon running Driver Verifier it crashed before even making it to the PIN screen of Windows and the dmp file suggested a bcbtums.sys culprit (which is the Bluetooth driver). Removing it did nothing...it just switched the behaviour to another error code.

 

Now, the problem.

For the past two weeks almost, the computer has been randomly crashing with BSODs of varying error codes notably when any form of GPU accelerated content was running (YT, Movies etc). Games have also been crashing to desktop with no error messages or sometimes lead to a BSOD.

The game doesn't matter. I've had a most of them crash, Warframe, Destiny 2, Black Desert,  League of Legends, Darksiders Genesis, Satisfactory, Rust etc. Same behaviour.

 

This of course prompted me to start troubleshooting. 

Seeing as the crashes were graphics related, I did the sensible thing and updated/reinstalled the graphics driver using DDU to clean up before. The behaviour continued but on a somewhat less frequent basis. Even so...I continued with the troubleshooting. I was inclined to believe MSI Afterburner and RivaTuner were at fault( it has been the case from time to time) . I tried different versions of it, both stable and BETA, to no avail.

I was using them for just the Fan Curve of the GPU and some metrics via the Overlay. I decided to uninstall them completely to eliminate any possible issue and made the fan curve from Radeon Software and used their metrics tab for FPS and whatnot(which btw is actually really sweet now, I was surprised). The problems persisted however. I don't have a spare GPU on me to test with yet, but I believe the GPU may be faulty in some way. Regardless though, I decided to test everything else so I can rule out other possible causes.

 

I proceeded to do a memory check. ran MEMTEST86 for an hour and a few minutes (2 passes). No errors came. I was expecting it, since, as I said in the notes on hardware, memory had been tested and proven to be stable.

 

I turned my attention to the storage drives. 

I checked S.M.A.R.T. and ran error checks using HDTune on the boot drive (the nVME) and the Samsung drive, since this is where the games are located. They all passed with flying colors. Lifespans are both at 95%, but it's normal, I've had these for 2 and 3,4 years I think respectively and cycled a lot of data through them.A quick System Stability Test from AIDA64 yielded no crashes. 

I also ran sfc /scannow and a Check disk, sfc yileded corrupted files...but the weird thing is... sfc/scannow always finds corrupted files...like everytime I run it(yes, even after reboots). Checkdisk however came back clean. 

 

I then suspected a PSU issue, but because I don't have a tester on hand(or another unit), I proceeded to load the MF with what the PC could most output. As a sidenote,I also changed the cables for the GPU and put separate connectors for each PCI-E auxiliary slot.I then proceeded to bring in the cavalry and set off Prime 95 small FFTs, in conjunction with Furmark on the GPU, upped all fans to 100%. Things became hot but not dangerously hot, by a mile. The CPU reached next to 68°C (whereas it barely passes 50°C in normal use), GPU went to as high as 76°C. VRM on the motherboard however peaked at 99°C. Cooling's bad on that one, especially since I have an AIO, there's basically no air reaching the VRM. Still the system didn't crash...not once. I then also ran an AIDA stability test stressing everything, again in the hopes of raising power consumption to maybe tip off the PSU. Nothing...everything went well...I was bewildered and called it a day, went to play some Destiny. Weirdly...it didn't crash that night, I played for like 3hrs straight...with no issues. 

 

Next day, thinking all it may have been was a graphics driver shenanigan (as it has been the case a lot of times with AMD drivers lately) I started doing my stuff...this time I got a BSOD while watching a YT video on the TV. 

 

I was literally furious...but alas...troublehooting away. I ran Driver Verifier starting with all drivers (except MS ones) hoping to find a culprit. Driver Verifier blue screened as soon as Windows loaded. I didn't even get to the PIN screen.
In the generated dump file? ene.sys, upon analyzing it with WinDbg.

ene.sys upon searching on the internet  seemed to be a driver for ASUS Aura Sync...to say I was confused is an understatement.

I have nothing directly related to ASUS RGB stuff. The only RGB I have are on the Mobo, but that's controlled via RGB Fusion(which I don't use since I can also do that in the BIOS), the GPU(again, set it once to red using Sapphire's Trixxxxxxxxxx whatever software and uninstalled everyting way way way back, like 2019 back. I believe I even wiped windows once after doing that )) and the Logitech Mouse. I believe this had it loaded in the G Hub suite for whatever reason...but to me it also seemed bizzare for that driver to cause such damage. 

 

Fed up with seemingly random errors, I wiped Windows clean. Moreover, I swapped the boot drive from the nVME to the SATA Samsung and let the nVME as a game library. I thought that maybe there was something wrong with the nVME driver and prompted random file corruptions that led to crashes, I dunno. I was kinda out of ideas at that point. 

 

Anyways, installation went fine, I installed the drivers, chipset, LAN, BT, Sound Card, GPU and you know...left it barebones without too much software. I installed Steam and 2 games, Black Desert and Destiny 2.Over the week-end everything was mighty fine, no crashes, games ran smoothly, no weird hangs, lag, anything. On Monday...BDO crashed twice...then D2 once...no BSODs. Then, I ran driver verifier again...it first prompted me with the same on boot crash that it was bcbtums.sys at fault (Bluetooth driver). HUH?!

I rolled my eyes and removed the receiver, uninstalled the BT driver...to no avail of course...another crash. I ran driver verifier once more, only on the AMD drivers(checked everything linked to AMD). This time, system booted, started doing stuff...it crashed in Chrome(because of DV), culprit, amdkmdag.sys

It also crashed again, with an nt module error.

 

At this point I had another idea. I remembered that some crashses occured as I was playing games and having a stream or a video opened on the secondary TV.

The thing is, the crashes also started like a week or two after I installed the TV as a monitor. Thinking the driver may not like multiple displays for some reason and since I have 2 DPs and 2 HDMIs, I also swapped the ports, thinking one may be fault. It wasn't the case I ended up unplugging the TV and have been running on single monitor since yesterday. 

Again...everything seemed fine, I played some BDO, some D2 for the better part of 6 hours yesterday, no crashes. 

Today...I entered a raid in D2, BSOD, BAD_POOL_CALLER, dxgmms2.sys. 

 

At this point I only have 3 ideas, but I'm mostly out of them. 

 

#1 -  GPU is physically faulty and that's what's causing the crashes. Although, what keeps me from defaulting to it is the fact that outside of the crashes, I haven't seen artifacts once which would indicate the GPU is dying. Furthermore, when it runs, it runs perfectly, temps are fine, there are no weird clock variations, no FPS dips, nothing.

 

#2 - The motherboard. Remember the *3 note on specs? USBs have been always funky with this board. CPU VRM has also always ran hot (although it's in spec, 100°C in Prime isn't bad per se, especially since in normal usage, it hovers around 70°C). Still, given the amalgam of seemingly random error codes and culprits, one would be inclined to think there's something wrong with it.

 

#3 - Another weird behaviour...the crashes lately seem to occur after a cold boot...I usually start the computer in the morning and shut it down at night when I go to sleep. I have noticed that, in the afternoon and towards the evening, there are basically no crashes. The sample size is small ofc, like 3-4 days, but it doesn't seem coincidental to me. 

I had the first Xonar U7 sound card a few years back and after 3 years or so it manifested the infamous Blink of Death, where the card would not start up and have one blue LED blinking constantly. Turns out it was a quartz chip that for whatever reason refused to work cold. Blowing hot air using a hair dryer  for a few seconds magically started up the sound card which would then work like a charm. 

I was thinking it may also be the case here, where I have a component, either on the GPU, or the motherboard that doesn't work properly until it heats up. I's far fetched but given the aformentioned scenario, not impossible. 

 

I am attaching 3 dump files that I still have, since my dumbass forgot to save the others before I wiped Windows clean.

I'd be really grateful if you had any ideas and things for me to test. 

Obviously the next step is for me to source a GPU and see if  the behaviour continues. I do have an old R9 270 laying around back at my parent's home but it's a long way and I really don't want to travel these days if I really don't need to, given the resurgence of COVID cases.

 

 

 

 

System dxgmms2.dmp Chrome amdkmdag.sys.dmp Chrome nt module.dmp

Edited by DeLawrence
Multiple edits for typos and some formatting
Link to comment
Share on other sites

Link to post
Share on other sites

Any tests should be done without overclocking. You're OC'ing the cpu.

You're pushing it with ram from different manufacturers.  Also Ryzen 1xxx processors are known for having a not so great memory controller, that is more tuned/tweaked/optimized towards some memory chips ... Hynix wasn't one of them. 

Then you also have the motherboard, which is not really great in the memory department ... you can even see it in the specs page : " Support for DDR4 3200(O.C.)/2933(O.C.)/2667*/2400/2133 MHz memory modules"

 

So you're running at high frequencies and you're also running 4 sticks, which is harder on the controller in the cpu.  

While diagnosing everything, I'd go down to 2666 Mhz with the memory, just to be on the safe side. 

 

Cold boot issues were mostly caused in the past by degraded electrolytic capacitors. Nowadays, motherboards use solid polymer capacitors everywhere it matters, they keep using some electrolytic capacitors with sound cards, as people think sound is better with more analogue like characteristics of electrolytic capacitors.

 

Power supplies also have a lot of polymer capacitors but they'll still include some electrolytic capacitors for bulk capacitance / additional filtering. That Seasonic power supply SHOULD be fine, but it wouldn't hurt to test with a different power supply. 

 

As for AMD driver issues, my advice would be to uninstall the drivers using DDU  and install older versions of AMD video card drivers. I have RX 570 at home and I think my drivers are from around March of this year, and there's no reason to upgrade to the latest. (edit see below, it's actually april, was close enough)

 

The only reason to upgrade would be for the control panel and more software options, but the drivers themselves are unlikely to have improvements.  They've long stopped improving "Polaris" (the family with RX xxx cards) branch, most of their work is on Navi, the RX 5xxx and RX 6xxx ... it's quite possible hot fixes and patches for games they put in these new releases to not be tested thoroughly with Polaris series cards and do worse for those. 

 

Yeah, just checked on amd website, here's the "previous drivers" page for  RX 5xx cards (it says rx 570 in name, but driver is same package for all RX 5xx cards including your RX 580) : https://www.amd.com/en/support/previous-drivers/graphics/radeon-500-series/radeon-rx-500-series/radeon-rx-570

 

The driver I currently have on my home computer is Adrenalin 21.4.1 Recommended (WHQL)  from 04/20/2021

 

edit.  NOTE ... you may have to go with a newer driver version if it's a requirement with the Windows 10 / 11 version, at home I still run Windows 7.

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, mariushm said:

Any tests should be done without overclocking. You're OC'ing the cpu.

You're pushing it with ram from different manufacturers.  Also Ryzen 1xxx processors are known for having a not so great memory controller, that is more tuned/tweaked/optimized towards some memory chips ... Hynix wasn't one of them. 

Then you also have the motherboard, which is not really great in the memory department ... you can even see it in the specs page : " Support for DDR4 3200(O.C.)/2933(O.C.)/2667*/2400/2133 MHz memory modules"

 

So you're running at high frequencies and you're also running 4 sticks, which is harder on the controller in the cpu.  

While diagnosing everything, I'd go down to 2666 Mhz with the memory, just to be on the safe side. 

 

I know it's kind of bad practice to test via OC, though it's been 4 years since the CPU has been running with those settings, and about 3 on the RAM, I saw no issue as to why it would suddenly not work, like completely. It's also not the max of what the hardware was capable of so it's not like I pushed them to the limit. I had a 4.1GHz config on the CPU and managed to push 3200 CL17 in testing, but it was really taxing on VRM temps and also held diminishing returns, that's why i settled for the config that I mentioned.

My reluctance to go back to stock honestly came from an issue with the BIOS profiles that I admit I haven't yet tested on the F51b version. In the previous iterations It would not save the settings even though it appeared to do so. And since everytime I updated the BIOS I had to manually punch in the timings again I didn't even think of returning to stock, so that I wouldn't go through the hassle of changing every value afterwards again. 

Although, I suppose at this stage, it wouldn't hurt returning to complete stock for a bit and see how that goes.


 

Cold boot issues were mostly caused in the past by degraded electrolytic capacitors. Nowadays, motherboards use solid polymer capacitors everywhere it matters, they keep using some electrolytic capacitors with sound cards, as people think sound is better with more analogue like characteristics of electrolytic capacitors.


So I thought too, that degraded parts were a thing of the past and even then...I still have like 2001-2002 boards that run just fine(even though they're unusable in terms of performance). It seemed odd to me for a board part to suddenly kick the bucket like that and not cause a complete failure, but alas, it was worth considering.

 

Power supplies also have a lot of polymer capacitors but they'll still include some electrolytic capacitors for bulk capacitance / additional filtering. That Seasonic power supply SHOULD be fine, but it wouldn't hurt to test with a different power supply. 

Yeah, the PSU wasn't one of my main concerns either, especially after the load tests. It would make no sense for it to work under sustained max load on CPU+GPU, fans etc, but randomly crash on lower levels of consumption. And given it's a Seasonic, I wouldn't expect it to behave randomly. If something was faulty in it I would expect the PC to not post at all, unless it's a cable or a loose connector or something(which I checked for, a-ok)

 

As for AMD driver issues, my advice would be to uninstall the drivers using DDU  and install older versions of AMD video card drivers. I have RX 570 at home and I think my drivers are from around March of this year, and there's no reason to upgrade to the latest. (edit see below, it's actually april, was close enough)

 

The only reason to upgrade would be for the control panel and more software options, but the drivers themselves are unlikely to have improvements.  They've long stopped improving "Polaris" (the family with RX xxx cards) branch, most of their work is on Navi, the RX 5xxx and RX 6xxx ... it's quite possible hot fixes and patches for games they put in these new releases to not be tested thoroughly with Polaris series cards and do worse for those. 

 

Yeah, just checked on amd website, here's the "previous drivers" page for  RX 5xx cards (it says rx 570 in name, but driver is same package for all RX 5xx cards including your RX 580) : https://www.amd.com/en/support/previous-drivers/graphics/radeon-500-series/radeon-rx-500-series/radeon-rx-570

 

The driver I currently have on my home computer is Adrenalin 21.4.1 Recommended (WHQL)  from 04/20/2021

 

edit.  NOTE ... you may have to go with a newer driver version if it's a requirement with the Windows 10 / 11 version, at home I still run Windows 7.

The reason I updated it in the first place was because I still had the long standing Alt-Tab Black Screen bug. It got fixed at some point in one of the releases, I think the July one (although it wasn't noted), but as luck would have it...fix one thing, 2 more problems shall appear...I had the intermitent issue with the fan not spinning at all (stuck in zero RPM mode) which I solved 3rd partily by using MSI Afterburner(hence why I had it in the first place). I also had some stupid issue where the GPU would not ramp up in clock speed, causing lag and low fps all around. 

AMD's known and fixed issues notes for the drivers are downright stupid...things that they said they fixed still happen, and not just for me...while long standing issues have yet to be acknowledged (like the alt tab black screen). 

 
Now that I'm looking at the previous versions, something that I didn't think of doing in all this clown fiesta(I don't know how, maybe I'm just tired and not thinking clearly)...I notice this has been a known issue since July:

Driver timeouts may be experienced while playing a game & streaming a video simultaneously on some AMD Graphics products such as Radeon™ RX 500 Series Graphics.

Could this be what I'm experiencing? I mean I would expect a driver timeout to like crash the game but not the computer...and certainly not pop up 1500 different error codes and possible causes. 

Hmm...I'll go back to 21.6.1 seeing as it's a WHQL release and go from there, see if anything improves.


I've answered to each of the points in the quoted post.
Thanks for the help, I'll report back if I find new developments.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×