Jump to content

Unstable 7900XTX Hellbound

Go to solution Solved by Neutraliz,

Update 5 (and hopefuly last) (few months later): 

 


My issue seems to be solved and the pc is back together, fully mounted since december. 

Looking back at it, the issue was a crossing of 2, if not 3 issues: 

1/ Both of my PCI-e powers were badly managed, and were compressed against the case. They enventually partly melted and resulted to unstabilities. 

2/ The PCI-e riser was also damaged by a losing screw on the back of the motherboard, which was pressed against it by the graphic card. No idea how that happened, cause I didn't ever dismount or even touch this part of the motherboard. 

3/ This very same lose screw could of also caused some short circuits with the pcie riser. Which could of been also a factor.  

So:

- I replaced the PCI-e riser for an updated "version" coming from Formd. Seems to have more direct and instant issues than my old PCI-E riser with pcie 4.0, but just downgrading pcie 3.0 in my motherboard's bios wipped out all issues. Been solid since then. 


- I replaced my damaged PCI-e power cables by custom-made by "DreambigbyRayMOD" on Etsy. 


- I (obviously) rescrewed the lose screw and then secured it with electronic tape. So not only it will be harder for it to get lose again, but it also will prevent direct contact. 


I've to thank @Skiiwee29, who dispite my doubts, did point out the pcie riser. 
And also, ffs, a dedicated test bench and a PSU tester can really save you a lot of time. I don't know what would have done without it. 

 


The lessons I learned from it:

- Never just THINK a component is good. Make damn sure of it. Inspect it. Test it. And then you eliminate it as a source of issue.

- Cables does have their limit, don't overstress them, they'll get hot and eventually melt. 

- Don't hesitate too much to take things appart. Yes it is annoying to tear it down from the case, but it is so much easier to troubleshoot a PC that way, when it's getting impossible to figure out what's going on. Especially a mini-ITX build.  

I knew all these little rules, but I still went passed it... And it burned me. Don't be like me. I'm an idiot. 

Hi,
Since I've dismounted this card for a simple repaste and thermal pad switch, it's been acting really strange.

I'm using 1,5mm thick thermal pads from thermal grizzly all around, VRM and VRAM. (Tested 1mm but wasn't thick enough for actual contact.)   

I did this on all my cards, and a lot of them, without any issues. Until that 7900XTX. 

First I had a big scare moment, cause the moment it switched in 3d mode, the card crashed immediately into a black screen. Did my homework, and verified any stupid things that could of been it (pcie, pci power etc). Nothing. 
Tested on my test bench: Same thing occured. 
In panic, I retired the thermal pads on the VRAM: Same result.
Then the thermal pads on the VRM: It was working again. 
Thing is, the 1.0mm thermal pad wasn't making any contact, and I trashed the orginal thermal pads. So I was commited. 
I put back 1.5mm on VRAM: Still working fine.
Does the same on VRM to double check if that was it: ... And now it works?! 
Okay...

So I played like this no issues for few days. 
My original intent was to drop temps on VRAM that was quite high (which worked!), and eventually undervolt it with the best interfaces the card could get. 

Seeing the card looked fine and working again for days without any issues, I started some undervolting. At first, it crashed the applications or games, but I was expecting it to do that until finding a good compromise between heat, noise, frequency and vcore. Problem is, that card appear to be a very poor undervolter, or something is wrong. 
It kept crashing even at stock, and eventually cycled back to black screens... Again. 

Today, it can't hold a load for more than few seconds now. It seems fine on desktop or boosting applications, but that's it.
Tried to revert to an older driver: Same thing.
Tried to switch the VBIOS: Same thing. 
I will try to mess with the VRM thermal pads I guess, but I'm getting concerns if I really did something really wrong to it. 

Specs:

  • AMD Ryzen 9 5950X, 16C/32T.

  • ASUS ROG STRIX X570-I GAMING.

  • DDR4 G.SKILL Trident Z Neo 3600mhz CL 16 16go, X2 (32go) (Samsung bdie, dual rank).

  • Fractal Design Lumen S24 v2, 240mm, w/ 2x Noctua NF-A12x15 PWM.

  • Powercolor Hellbound AMD Radeon RX 7900XTX. (~= 4080)

  • HDD: Seagate Barracuda 2.5" 5 To, 5400 rpm.

  • SSD (1/boot): Corsair MP600, 1 To, NVME, PCIe gen 4.

  • SSD (2/games): Samsung SSD 860 EVO 4 To, SATA III.

  • FormD T1 V2, noir.

  • Cooler Master V850, 850W, 80+ Gold, SFX.



Any suggestions would be appreciated.

Link to comment
https://linustechtips.com/topic/1583281-unstable-7900xtx-hellbound/
Share on other sites

Link to post
Share on other sites

Whats the temps on your GPU die and hot spot? When you use the wrong thermal pads, the GPU die may not be making correct contact with the heatsink which can lead to overheating and issues.

Community Standards

Please make sure to Quote me or @ me to see your reply!

Just because I am a Moderator does not mean I am always right. Please fact check me and verify my answer. 

 

"Beast Mode"

Ryzen 7 9800x3d | Arctic Liquid Freeze 3 Pro 360 | MSI X870 Tomahawk Wi-Fi | MSI RTX 5080 Gaming Trio OC | Gskill Flare X5 6000MT/s CL30

1tb WD Black SN850x NVMe | 4tb WD SN850x NVMe | Antec Flux Pro | Be Quiet Pure Power 13 M 1000w | OWC 10gb NIC

 

Dedicated Streaming Rig

 Ryzen 7 3700x | Asus B450-F Strix | 32gb Gskill Flare X 3200mhz | Corsair RM550x PSU | MSI Ventus 3060 12gb | 250gb 860 Evo m.2

Phanteks P300A |  Elgato HD60 Pro | Avermedia Live Gamer Duo | Avermedia 4k GC573 Capture Card

 

Link to post
Share on other sites

25 minutes ago, Skiiwee29 said:

Whats the temps on your GPU die and hot spot? When you use the wrong thermal pads, the GPU die may not be making correct contact with the heatsink which can lead to overheating and issues.

They are under control, and even I only reduce the power limit, the card still does the same thing.

There is a big delta between die and hot spot, but the card was already like that when I got it. 
With -5% power limit, the die sits at ~60°c while hotspot stabilize at ~87°c. 

I don't think heat is the issue, the card doesn't even have time to heat up. Plus the system doesn't shutdown, it still runs after a blackscreen. 

Link to post
Share on other sites

27c delta is huge and a large indication of bad contact on the Die. I bet when you load up the card, that hotspot immediately jumps and is causing your crash. This would be my guess. Generally the delta should be a lot closer, usually in the 10-15c range if the die has proper contact.

Community Standards

Please make sure to Quote me or @ me to see your reply!

Just because I am a Moderator does not mean I am always right. Please fact check me and verify my answer. 

 

"Beast Mode"

Ryzen 7 9800x3d | Arctic Liquid Freeze 3 Pro 360 | MSI X870 Tomahawk Wi-Fi | MSI RTX 5080 Gaming Trio OC | Gskill Flare X5 6000MT/s CL30

1tb WD Black SN850x NVMe | 4tb WD SN850x NVMe | Antec Flux Pro | Be Quiet Pure Power 13 M 1000w | OWC 10gb NIC

 

Dedicated Streaming Rig

 Ryzen 7 3700x | Asus B450-F Strix | 32gb Gskill Flare X 3200mhz | Corsair RM550x PSU | MSI Ventus 3060 12gb | 250gb 860 Evo m.2

Phanteks P300A |  Elgato HD60 Pro | Avermedia Live Gamer Duo | Avermedia 4k GC573 Capture Card

 

Link to post
Share on other sites

Update:

 

I've good news and bad news.

 

The good news is the fact the card seem to work again. When today it was uncapable of substaining a load for more than few seconds, it just did the full 20 rounds of Time Spy Extreme Stress Test without issues. I even tried to mess with by moving the window around (which previously crashed it even faster).

 

The bad, is the fact the VRM doesn't have any contact with the cooler anymore, because I retired the thermal pads. (I DON'T RECOMMAND ANYONE TO DO THAT

Only VRAM had that priviledge.

 

I paid extra-extra attention to screw the card properly back together to insure good contact the GPU. 

The delta is still the same.

But the card seems stable for now.

 

Is it the extra attention or really the VRM thermal pad again? Or is it completly unrelated? No idea for now.

 

I already took action and asked Powercolor to provide the specs of the springed screws they used for the backplate and just behind the core. Which they provided without issues.

I plan to replace those, cause the the first teardown gave me issues, that degraded the screws, and now the screw driver struggle to grip on it.

 

I also plan to ask them the exact thickness of the stock thermal pad. It's probably something weird like 1.20, 1.25 or 1.30? 

 

If anyone have suggestions, feel free to suggest.

 

Link to post
Share on other sites

Update 2:
 

It's getting ever more weird.

 

At this point, I don't know what to do.

 

It seems to be working on the test rig? Or maybe I didn't test it long enough?

 

I reseted windows completly on the main rig. It seems to be fine under load
I thought also about a PSU issue, but it just did a blackscreen on idle... So... Probably not?

 

Could it be the PCI-e riser? 

Is it a software issue?
But the start of all of this was me dismounting the card, so... I really don't know what happened. 

Link to post
Share on other sites

23 minutes ago, Neutraliz said:

Update 2:
 

It's getting ever more weird.

 

At this point, I don't know what to do.

 

It seems to be working on the test rig? Or maybe I didn't test it long enough?

 

I reseted windows completly on the main rig. It seems to be fine under load
I thought also about a PSU issue, but it just did a blackscreen on idle... So... Probably not?

 

Could it be the PCI-e riser? 

Is it a software issue?
But the start of all of this was me dismounting the card, so... I really don't know what happened. 

Take it off the riser and test. Wouldn't be the first time one has been the root cause due to signal integrity.

Community Standards

Please make sure to Quote me or @ me to see your reply!

Just because I am a Moderator does not mean I am always right. Please fact check me and verify my answer. 

 

"Beast Mode"

Ryzen 7 9800x3d | Arctic Liquid Freeze 3 Pro 360 | MSI X870 Tomahawk Wi-Fi | MSI RTX 5080 Gaming Trio OC | Gskill Flare X5 6000MT/s CL30

1tb WD Black SN850x NVMe | 4tb WD SN850x NVMe | Antec Flux Pro | Be Quiet Pure Power 13 M 1000w | OWC 10gb NIC

 

Dedicated Streaming Rig

 Ryzen 7 3700x | Asus B450-F Strix | 32gb Gskill Flare X 3200mhz | Corsair RM550x PSU | MSI Ventus 3060 12gb | 250gb 860 Evo m.2

Phanteks P300A |  Elgato HD60 Pro | Avermedia Live Gamer Duo | Avermedia 4k GC573 Capture Card

 

Link to post
Share on other sites

1 hour ago, Skiiwee29 said:

Take it off the riser and test. Wouldn't be the first time one has been the root cause due to signal integrity.

I can't really do that with this case tho. I'll have to replace the pcie riser completly. Otherwhise, I'll have to disassemble the pc completly... But I mean at this point...

Probably will try that, but the timing of that failling would be pretty odd. 

Do you have any recommandation of solid pcie riser 4.0?  

Link to post
Share on other sites

Update 4:

Might of find a suspect, could be the pcie riser indeed, or a least somewhat related.
On the main pc, the situation degraded rapidly, there's just no display at all now. I thought the card was done.

But I put the card and the test rig, and it seems to run fine. (For now anyway)

I also dismounted the pcie riser, and foud out that some screws on the back of the motherboard were a bit losed, and pressed against the pcie riser, it did a bit of damage. Now, is that damage/contact related to anything, I still don't know. I connected that same riser on the test rig, and it's still fine. (Never wished so bad to have a crash ffs, so I can pin point what's the problem, and it doesn't! It's so frustrating...) 

Link to post
Share on other sites

  • 3 months later...

Update 5 (and hopefuly last) (few months later): 

 


My issue seems to be solved and the pc is back together, fully mounted since december. 

Looking back at it, the issue was a crossing of 2, if not 3 issues: 

1/ Both of my PCI-e powers were badly managed, and were compressed against the case. They enventually partly melted and resulted to unstabilities. 

2/ The PCI-e riser was also damaged by a losing screw on the back of the motherboard, which was pressed against it by the graphic card. No idea how that happened, cause I didn't ever dismount or even touch this part of the motherboard. 

3/ This very same lose screw could of also caused some short circuits with the pcie riser. Which could of been also a factor.  

So:

- I replaced the PCI-e riser for an updated "version" coming from Formd. Seems to have more direct and instant issues than my old PCI-E riser with pcie 4.0, but just downgrading pcie 3.0 in my motherboard's bios wipped out all issues. Been solid since then. 


- I replaced my damaged PCI-e power cables by custom-made by "DreambigbyRayMOD" on Etsy. 


- I (obviously) rescrewed the lose screw and then secured it with electronic tape. So not only it will be harder for it to get lose again, but it also will prevent direct contact. 


I've to thank @Skiiwee29, who dispite my doubts, did point out the pcie riser. 
And also, ffs, a dedicated test bench and a PSU tester can really save you a lot of time. I don't know what would have done without it. 

 


The lessons I learned from it:

- Never just THINK a component is good. Make damn sure of it. Inspect it. Test it. And then you eliminate it as a source of issue.

- Cables does have their limit, don't overstress them, they'll get hot and eventually melt. 

- Don't hesitate too much to take things appart. Yes it is annoying to tear it down from the case, but it is so much easier to troubleshoot a PC that way, when it's getting impossible to figure out what's going on. Especially a mini-ITX build.  

I knew all these little rules, but I still went passed it... And it burned me. Don't be like me. I'm an idiot. 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×