Jump to content

RTX 4090 Constant BSOD DPC_WATCHDOG_VIOLATION

ManEatingFridge
Go to solution Solved by ManEatingFridge,
On 10/15/2023 at 10:29 AM, Mark Kaine said:

Funny enough that error message pretty much guarantees its *not* the gpu (directly at least)

 

 

replace this with real ram, 2 sticks only, trident z

Funny enough that I got the GPU refunded and after putting in a new one it fixed all of my issues.

I think it is best summed up by the support request I sent to ZOTAC:

Quote

I purchased a ZT-D40900B-10P SKU RTX 4090 November 21 2022 and it has had developed severe intermittent crashing issues. The card causes the graphics driver (nvlddmkm) to freeze and the PC to crash on a DPC Watchdog Violation BSOD after the GPU has been unresponsive for too long. This happens happens when a 3D application has been opened, or running for a while, or more recently simply less than 20 seconds after booting in to Windows.

The PC is powered by a Seasonic PRIME TX-1000 1000W 80+ Titanium power supply and there are no issues with the CPU or memory after running Windows Memory Diagnostic and CPU stability tests. When connecting the display to the motherboard output all issues cease and the PC is perfectly stable. The crashing issues have been intermittent. They started with 3D applications crashing due to the DX12 device becoming unresponsive after some period (15min - 3h). This progressed to the mentioned increasingly frequent BSOD issues with sometimes 10+ restarts clearing the issue for a while. More recently it is impossible to use the PC with the main display connected to the graphics card as it crashes shortly after booting into Windows or even at the Windows lock screen. I have tried rolling back drivers, reinstalling Windows twice, updating VBIOS to latest from Zotac, underclocking the card to 80% power target in Firestorm, plugging the PC straight into a wall socket instead of a surge protector, and reseating the card in the PCIe slot.

I attempted an RMA with the seller (www.ebuyer.com) on 31/03/2023 but they reported they could not reproduce the issue:

"Following extensive tests by our Returns Technicians we have been unable to locate the fault you reported, therefore, the goods will be returned to you with no further action."

Due to the intermittent nature of the fault, I was somewhat able to use the card after it was returned but the issue has gotten worse and worse. I can no longer use the card for work in heavy 3D applications (game engines) and would like at least some technical support or a complete replacement from the manufacturer.

I sent this and got an automated response telling me a ZOTAC representative would get in contact within 24-72 hours. It's been 2 weeks so idk what to do at this point. An update on the "when connecting the display to the motherboard output all issues cease and the PC is perfectly stable" - this is also no longer the case. Doing that made the PC stable enough to get into Windows but as soon as I open a game or game engine the display freezes and it will BSOD if I wait for the DPC watchdog to time out.

 

Got the BSOD information by launching CS2, leaving the main menu running for about a minute. Game froze, closed the game, then after about 90 seconds or however long it takes the DPC watchdog I got the DPC_WATCHDOG_VIOLATION BSOD (with heavy graphical artifacts). Zip attached, PC spec: https://uk.pcpartpicker.com/b/WpBcCJ, OS is OEM x64 Windows 10 Pro 10.0.19045 Build 19045, NVIDIA driver is 537.42 (latest as of 30/09/2023). Couldn't get perfmon /report to finish collecting for some reason.

 

Anything for me to try or does anyone have recommendations for how to get ZOTAC or the seller to replace this brick?

SysnativeFileCollectionApp.zip

Send help

Link to comment
Share on other sites

Link to post
Share on other sites

Hi, this sucks.. @ManEatingFridge

 

But there can be other reasons for DPC_WATCHDOG_VIOLATION than your GPU. 

 

Copied from google search.

"The DPC_WATCHDOG_VIOLATION Windows 10 error can be caused by hardware conflicts. If there are newly installed external hard drives, external solid-state drive, printer, or scanner on your PC, you should remove or disconnect all these external devices. Then restart your computer."

 

You don't say much about your system except it's a zotac 4090 in it and 1000w psu.. what about everything else in it? 

 

Link to comment
Share on other sites

Link to post
Share on other sites

@Robchil The pcpartpicker list https://uk.pcpartpicker.com/b/WpBcCJ is in the post but here you go:

 

CPU: AMD Ryzen 9 7950X

Mobo: Asus ROG STRIX X670E-F GAMING WIFI

RAM: Corsair Vengeance 64 GB (4 x 16 GB) DDR5-5600 CL36

GPU: Zotac GAMING AMP Extreme AIRO GeForce RTX 4090

SSD: Samsung 990 Pro 2 TB

PSU: SeaSonic PRIME TX-1000

 

The reason I don't suspect any other components is because Event Viewer specifically says nvlddmkm crashed, there are graphical artifacts on the BSOD, Memory Diagnostic shows no issues with the memory, CPU torture tests show no issues, temps are all normal even right up to the crash from HWiNFO logging, the crashing is triggered by 3D applications, and most importantly when I remove the GPU from the PC all issues cease. Anecdotally a local PC repair shop guy said he has never seen a Seasonic PSU fail in 21 years and said my issues are 99% likely just a manufacturing defect in the GPU memory.

 

If you have suggestions for what else might cause the issue I'm open to it though.

Send help

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, ManEatingFridge said:

@Robchil The pcpartpicker list https://uk.pcpartpicker.com/b/WpBcCJ is in the post but here you go:

 

CPU: AMD Ryzen 9 7950X

Mobo: Asus ROG STRIX X670E-F GAMING WIFI

RAM: Corsair Vengeance 64 GB (4 x 16 GB) DDR5-5600 CL36

GPU: Zotac GAMING AMP Extreme AIRO GeForce RTX 4090

SSD: Samsung 990 Pro 2 TB

PSU: SeaSonic PRIME TX-1000

 

The reason I don't suspect any other components is because Event Viewer specifically says nvlddmkm crashed, there are graphical artifacts on the BSOD, Memory Diagnostic shows no issues with the memory, CPU torture tests show no issues, temps are all normal even right up to the crash from HWiNFO logging, the crashing is triggered by 3D applications, and most importantly when I remove the GPU from the PC all issues cease.

 

If you have suggestions for what else might cause the issue I'm open to it though.

it just means your card is conflicting with something else in your system.. that the vendor didn't have in their system and couldn't reproduce it. 

 

and but it's the nvlddmkm that crashes due to the conflict.. 1000w sounds like plenty power.. but have you tried other components to test if it stops then? 

 

i would start having mainboard with cpu a cooler and ram installed. and a disk with windows on with the psu outside the case to test if it still happens if nothing else is connected. 

 

have you tried disabling the nvidia audio devices? 


does it happen when you have the GPU in the second PCIE port? 

 

Have you tried with nvme in a different slot? 

 

Link to comment
Share on other sites

Link to post
Share on other sites

All of the audio devices other than Focusrite USB Audio are disabled. I haven't tried with the GPU in any other PCIE slots because it doesn't really fit anywhere other than the first one:image.thumb.png.4f3e091a42979d53186c518baff0bae3.png

 

I could try swapping the SSD to a different M.2 slot I guess but I don't have that much hope that it would help. As for powering the system on with a minimal configuration I already did this after my second Windows reinstall and the stability issues were still there (back then it was just games crashing because the DirectX device became unresponsive).

Send help

Link to comment
Share on other sites

Link to post
Share on other sites

44 minutes ago, ManEatingFridge said:

All of the audio devices other than Focusrite USB Audio are disabled. I haven't tried with the GPU in any other PCIE slots because it doesn't really fit anywhere other than the first one:image.thumb.png.4f3e091a42979d53186c518baff0bae3.png

 

I could try swapping the SSD to a different M.2 slot I guess but I don't have that much hope that it would help. As for powering the system on with a minimal configuration I already did this after my second Windows reinstall and the stability issues were still there (back then it was just games crashing because the DirectX device became unresponsive).

that's good info that you tried everything outside. it should fit in the lower pcie slot but you would have to take the board out i assume due to spacing. 

if the m.2 drive only is pcie 4.0 anyway i would use one of those. 

 

but test swapping nvme slot first. 

 

Link to comment
Share on other sites

Link to post
Share on other sites

31 minutes ago, Robchil said:

that's good info that you tried everything outside. it should fit in the lower pcie slot but you would have to take the board out i assume due to spacing. 

if the m.2 drive only is pcie 4.0 anyway i would use one of those. 

 

but test swapping nvme slot first. 

 

Tested it and no difference

Send help

Link to comment
Share on other sites

Link to post
Share on other sites

33 minutes ago, ManEatingFridge said:

Tested it and no difference

tested different pcie slot too already? 

 

look into pcie 4x band on the nvme config in bios too.. set it down to 2x and see if that makes a difference.  

 

check that cpu cooler is not too tight on the cpu, it can have an effect. but mostly with ram.  but if it can have with ram, it can have issues with pcie lanes too. as they are directly connected to the cpu.. 

the cpu itself can be a problem too.  check pins that they are all ok.. 

 

example.. if it works with the secondary pcie slot, it can be cpu or mainboard that is the problem. 

 

or generaly just that you lost in the lottery on what is compatible with each other too. 

 

Link to comment
Share on other sites

Link to post
Share on other sites

17 minutes ago, Robchil said:

tested different pcie slot too already? 

 

look into pcie 4x band on the nvme config in bios too.. set it down to 2x and see if that makes a difference.  

 

check that cpu cooler is not too tight on the cpu, it can have an effect. but mostly with ram.  but if it can have with ram, it can have issues with pcie lanes too. as they are directly connected to the cpu.. 

the cpu itself can be a problem too.  check pins that they are all ok.. 

 

example.. if it works with the secondary pcie slot, it can be cpu or mainboard that is the problem. 

 

or generaly just that you lost in the lottery on what is compatible with each other too. 

 

If it was just a compatibility issue I don't understand how the GPU could have worked perfectly fine for the first 3 months or so. Worth mentioning that I tested all the components on a different motherboard when my friend was building a PC at one point and saw the same issues. Already checked the CPU pins before and there is nothing wrong with them. I have tested the system on a 980 evo 1TB ssd I had lying around and saw the same issues.

Send help

Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, ManEatingFridge said:

If it was just a compatibility issue I don't understand how the GPU could have worked perfectly fine for the first 3 months or so. Worth mentioning that I tested all the components on a different motherboard when my friend was building a PC at one point and saw the same issues. Already checked the CPU pins before and there is nothing wrong with them. I have tested the system on a 980 evo 1TB ssd I had lying around and saw the same issues.

have you tested with another psu? . 1200w one?  i would..  and return it if it's not making a difference. 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Robchil said:

have you tested with another psu? . 1200w one?  i would..  and return it if it's not making a difference. 

 

 

This is basically the only component I haven't been able to rule out because all of the PSU troubleshooting basically starts and ends with "compare the PSU against a known working unit." I don't really want to buy a new PSU just to see if my 1000W 80plus titanium PSU could be faulty - hence why I went with trying to return the card. That brings me back to the original post:

 

ZOTAC so far just hasn't responded to my support ticket for 2 weeks and the one time I tried to RMA through the seller they just sent the card back. The PC repair shop I talked to said it's likely they just plugged the card in and didn't really test it and recommended basically what I'm doing now. Trying to RMA directly through the manufacturer or trying to RMA with the seller again.

 

I'm left very frustrated that my expensive workstation PC is left useless until ZOTAC decides to grace me with a response or I have to cross my fingers and hope the seller somehow this time does replace the card. I guess I could try calling ZOTAC support next week?

Send help

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, ManEatingFridge said:

This is basically the only component I haven't been able to rule out because all of the PSU troubleshooting basically starts and ends with "compare the PSU against a known working unit." I don't really want to buy a new PSU just to see if my 1000W 80plus titanium PSU could be faulty - hence why I went with trying to return the card. That brings me back to the original post:

 

ZOTAC so far just hasn't responded to my support ticket for 2 weeks and the one time I tried to RMA through the seller they just sent the card back. The PC repair shop I talked to said it's likely they just plugged the card in and didn't really test it and recommended basically what I'm doing now. Trying to RMA directly through the manufacturer or trying to RMA with the seller again. I'm left very frustrated that my expensive workstation PC is left useless until ZOTAC decides to grace me with a response or I have to cross my fingers and hope the seller somehow this time does replace the card.

well.. altho i have an older system.. with i9 9900k 2 2080ti's in sli.. i have a 1600w psu.. with a 4090 i would never use anything less than a 1200w one..  you get the most expencive gpu chip.. and save on the power.. 

 

both 3090 and 4090 i would pair with 1200+ watt..  they are shown to have spikes.. and if it worked for 3 months it might be about the time and use your PSU lasted during the spikes..  and it basicaly just can't deliver through spikes anymore. 

 

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, Robchil said:

well.. altho i have an older system.. with i9 9900k 2 2080ti's in sli.. i have a 1600w psu.. with a 4090 i would never use anything less than a 1200w one..  you get the most expencive gpu chip.. and save on the power.. 

 

both 3090 and 4090 i would pair with 1200+ watt..  they are shown to have spikes.. and if it worked for 3 months it might be about the time and use your PSU lasted during the spikes..  and it basicaly just can't deliver through spikes anymore. 

 

I think it was gamersnexus that showed the card only spiking 40% for 1ms which is within range of a 1000W PSU even if it would spike that high. More importantly, I've run the card at an 80% power limit (and the CPU in 140W Eco Mode) for most of its life so it wouldn't even get that high.

Send help

Link to comment
Share on other sites

Link to post
Share on other sites

All the dump files point to the Nvidia driver. With this crash it can be hard for the OS to know which driver is at fault so I looked through the other pending DPCs listed in the dump file, but I didn't find any good suspects. The PPM (Processor Power Management) did show up often, but that's quite common. You could try updating the Chipset driver, but I don't have high hopes.

 

There is a new BIOS out that you could try flashing to. You can also try using DDU to clean out the old driver before installing a fresh one if you haven't tried this already. 

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, ManEatingFridge said:

I think it was gamersnexus that showed the card only spiking 40% for 1ms which is within range of a 1000W PSU even if it would spike that high. More importantly, I've run the card at an 80% power limit for most of its life so it wouldn't even get that high.

ah.. to avoid the melted connector? 

well.. it can be that your psu just is bad and failure manifested after 3 months..  have you tested other GPU's with the same powersupply? not that many managed to pull the same loads a 4090 does. 

RMA the PSU would be next step if they don't accept RMA on your GPU. 

 

Link to comment
Share on other sites

Link to post
Share on other sites

32 minutes ago, Bjoolz said:

All the dump files point to the Nvidia driver. With this crash it can be hard for the OS to know which driver is at fault so I looked through the other pending DPCs listed in the dump file, but I didn't find any good suspects. The PPM (Processor Power Management) did show up often, but that's quite common. You could try updating the Chipset driver, but I don't have high hopes.

 

There is a new BIOS out that you could try flashing to. You can also try using DDU to clean out the old driver before installing a fresh one if you haven't tried this already. 

Thanks a lot for taking a look at the dump files. I don't have much hope for updating drivers/BIOS/DDU since I went through an extensive round of that when these issues started developing, going as far as getting the latest VBIOS for the GPU and format+reinstalling Windows twice and constantly updating BIOS over the last 6 months. That being said can never hurt to try one more time so I'll give it a shot.

Send help

Link to comment
Share on other sites

Link to post
Share on other sites

42 minutes ago, Robchil said:

ah.. to avoid the melted connector? 

well.. it can be that your psu just is bad and failure manifested after 3 months..  have you tested other GPU's with the same powersupply? not that many managed to pull the same loads a 4090 does. 

RMA the PSU would be next step if they don't accept RMA on your GPU. 

 

It was mostly because I just didn't need the card to use that much power when I was getting more than enough performance most of the time. I wish I had more components lying around to swap out the GPU/PSU but I moved to the UK and sold my old PC after putting this one together. Don't really want to buy replacement components this expensive in a blind try to fix the PC. At that point, I might as well take the PC to the local repair shop for them to try but they just told me to RMA the card again since that is almost certainly the issue.

Send help

Link to comment
Share on other sites

Link to post
Share on other sites

16 minutes ago, ManEatingFridge said:

It was mostly because I just didn't need the card to use that much power when I was getting more than enough performance most of the time. I wish I had more components lying around to swap out the GPU/PSU but I moved to the UK and sold my old PC after putting this one together. Don't really want to buy replacement components this expensive in a blind try to fix the PC. At that point, I might as well take the PC to the local repair shop for them to try but they just told me to RMA the card again since that is almost certainly the issue.

yeah i would think GPU too they way you have described it.. but there is alot of components, like the PSU not realy causing problems unless it's under heavy load from a strong GPU. 

it's almost like tossing pasta at the wall and seeing what sticks. 

 

but if they can't replicate it in their rig... we know nothing about what their setup is tho.. 

so PSU is the only thing i can think of that would be limiting, even if it's a 1000w one 

 

but... it's also weird you get artifacting when it crashes..  it usually just go black and turns off if it's psu. 

 

so you kind of have to try anything in the troubleshooting book..  also turning off expo\xmp and see if that does anything . and everything else between A and Z .. 

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Bjoolz said:

All the dump files point to the Nvidia driver. With this crash it can be hard for the OS to know which driver is at fault so I looked through the other pending DPCs listed in the dump file, but I didn't find any good suspects. The PPM (Processor Power Management) did show up often, but that's quite common. You could try updating the Chipset driver, but I don't have high hopes.

 

There is a new BIOS out that you could try flashing to. You can also try using DDU to clean out the old driver before installing a fresh one if you haven't tried this already. 

Updated chipset driver, BIOS, DDU + reinstalled latest NVIDIA driver. Got the same BSOD, no change.

Send help

Link to comment
Share on other sites

Link to post
Share on other sites

28 minutes ago, ManEatingFridge said:

Updated chipset driver, BIOS, DDU + reinstalled latest NVIDIA driver. Got the same BSOD, no change.

Could you physically remove the GPU and use the motherboard video output and see if you crash. And if you do, please post the dump file. 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Bjoolz said:

Could you physically remove the GPU and use the motherboard video output and see if you crash. And if you do, please post the dump file. 

If you look at the pcpartpicker link you'll see why I have to repaste my CPU to remove/reinstall the GPU. Figured I could try this anyway since I would need to remove the GPU if the RMA request goes through.image.thumb.png.14d8d88d287b99057169cce19dfd4076.png

An incredible 2.21 FPS. Anyway, I already spent close to a month with no GPU in the PC when I did my first RMA request so I already know it just doesn't crash with it removed. I can run a CPU stress test or leave a 3D application running for 12h or something but I doubt it will result in a BSOD.

Send help

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, ManEatingFridge said:

If you look at the pcpartpicker link you'll see why I have to repaste my CPU to remove/reinstall the GPU. Figured I could try this anyway since I would need to remove the GPU if the RMA request goes through.image.thumb.png.14d8d88d287b99057169cce19dfd4076.png

An incredible 2.21 FPS. Anyway, I already spent close to a month with no GPU in the PC when I did my first RMA request so I already know it just doesn't crash with it removed. I can run a CPU stress test or leave a 3D application running for 12h or something but I doubt it will result in a BSOD.

Everything just points to the GPU then in my mind. 

Link to comment
Share on other sites

Link to post
Share on other sites

  • 2 weeks later...

Hi! Did you figure out a solution? I am in a similar situation. I just bought all top-of-the-line parts: 64GB DDR5 ram, ASUS RTX 4090 with liquid cooling, Noctua NH-D15 cpu cooler, ASUS ROG STRIX Z790-E motherboard, ASUS ROG HELIOS case, Crucial 4TB PCIe 5.0 SSD, and ASUS ROG THOR 1200W PSU.

 

All fresh out of the box, I put it all together and got a fresh install of Windows11. Updated BIOS and all drivers. After 2 days, it starting crashing with the same error that you had. Whenever any form of 3D graphics would be put up, it would crash. Now it crashes as soon as I log onto my account 😕

 

I tried everything that OP tried as well, and i have a 1200w PSU with a display that shows the total wattage that it is outputting, and it has never shown above 400W. Temperature is stable and cables are connected. Tried multiple versions of GPU drivers, used DDU to uninstall previous versions.

 

This is very frustrating and disappointing, any advice would be appreciated.

Link to comment
Share on other sites

Link to post
Share on other sites

did you guys check the PSU cable, maybe it shortcicuited or maybe it is malfunctioning.

Try replacing the cable with another one or reseat the cable if it is a modular psu.

 

Pls Mark a solution as a solution, would be really helpful.

BTW pls correct me, iam really stoobid at times.

Link to comment
Share on other sites

Link to post
Share on other sites

Funny enough that error message pretty much guarantees its *not* the gpu (directly at least)

 

 

On 9/30/2023 at 2:27 PM, ManEatingFridge said:

RAM: Corsair Vengeance 64 GB (4 x 16 GB)

replace this with real ram, 2 sticks only, trident z

The direction tells you... the direction

-Scott Manley, 2021

 

Softwares used:

Corsair Link (Anime Edition) 

MSI Afterburner 

OpenRGB

Lively Wallpaper 

OBS Studio

Shutter Encoder

Avidemux

FSResizer

Audacity 

VLC

WMP

GIMP

HWiNFO64

Paint

3D Paint

GitHub Desktop 

Superposition 

Prime95

Aida64

GPUZ

CPUZ

Generic Logviewer

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×