Jump to content

PCI Special Interest Group blames Nvidia for RTX 40 series 12VHPWR melting cables

AlTech
30 minutes ago, Mark Kaine said:

That's really convenient though... and who's to judge this anyway? 

Every single case has the same sign of not being plugged in because there is a line on the connector that proves it wasn't plugged in.

30 minutes ago, Mark Kaine said:

Let's just say it was "user error" , doesn't that imply that the design Nvidia decided on is really not suitable for average users and should have therefore been designed differently? 

?

.05% failure rate from this, .05% this issues are not from the average user

30 minutes ago, Mark Kaine said:

 

i can think of several ways to design such a connector in an unsafe way that leads to a lot of "user errors"... wouldn't that then not still be a fail design ? 

The issue is they are not plugging it in all the way, and the issue because exacerbated by being able to pull to the side with foreign debris causing a high resistance parallel current path that should not exist. There are not several ways this is unsafe, its literally one way. a better clip would not fix this issue as the failure happens before that point.

30 minutes ago, Mark Kaine said:

 

tbth, yes i think so ... safety critical features should be designed with ease of use and clear feedback to the user in mind. 

Which one could definitely argue Nvidia failed to do.

.05% failure rate once again, and its not Nvidia's standard, its not on them.

30 minutes ago, Mark Kaine said:

 

 

tldr: as you can read in this very thread, Nvidia made an unsafe design with a too short pin that doesn't give a clear feedback its actually attached properly and even then it might come off all by itself.  This is exactly the opposite of "user error" and 100% on whoever designed this fail plug...

 

 

Nvidia did not make this design. the pins are not too short, in fact you could argue the sense pins are to long. feedback that its attached properly is literally a non issue as the failure happens before you get to that point. At no point was a fully connected with clip connector "might come off" no amount of feedback would fix what happened here, because even if the feedback mode was a loud siren saying "IM CONNECTED" the user would not hear or feel it, because again, they never reached that step. The clip never engaged for there to be feedback.

this is the exactly user error.

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, starsmine said:

.05% failure rate from this, .05% this issues are not from the average user

8 hours ago, Mark Kaine said:

Now what's the failure rate on people who actually perform significant cable management? Because the vast, vast majority of system builders don't. If you don't get the clip to engage, but then don't really fuck with the cable getting it to look just right, it's never going to fry itself. In order to know the actual failure rate of the clip system frying itself, you would have to know the percentage of system builders that do perform significant amounts of cable management, and then take the rate at which the cards are frying from that.  

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, starsmine said:

At no point was a fully connected with clip connector "might come off"

that is exactly what has been claimed in several instances though. 

 

9 hours ago, starsmine said:

a better clip would not fix this issue

yes it would thats the whole point, the current one is difficult to clip in fully and also might come off again even when it was plugged in correctly *because it is too short*.

100% design failure. 

 

9 hours ago, starsmine said:

Nvidia did not make this design

Who did ?

 

 

The direction tells you... the direction

-Scott Manley, 2021

 

Softwares used:

Corsair Link (Anime Edition) 

MSI Afterburner 

OpenRGB

Lively Wallpaper 

OBS Studio

Shutter Encoder

Avidemux

FSResizer

Audacity 

VLC

WMP

GIMP

HWiNFO64

Paint

3D Paint

GitHub Desktop 

Superposition 

Prime95

Aida64

GPUZ

CPUZ

Generic Logviewer

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, Mark Kaine said:

that is exactly what has been claimed in several instances though. 

The thing is though you can see the line in every instance of how far it was plugged in. no one was close to the clip

7 minutes ago, Mark Kaine said:

yes it would thats the whole point, the current one is difficult to clip in fully and also might come off again even when it was plugged in correctly *because it is too short*.

100% design failure. 

No it would not, none of these users had plugged it in correctly. The "because its to short" here does not make sense, that is literally not the issue. nothing here is to short. there is an argument to be had that the sense pins are to long, but nothing is too short.

7 minutes ago, Mark Kaine said:

 

Who did ?

 

 

PCI-SIG

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, starsmine said:

The thing is though you can see the line in every instance of how far it was plugged in. no one was close to the clip

yes, because its a fail design. 

 

the reason it failed is not the argument here, the argument is why it wasn't seated properly,  and that's where the design matters, the old plugs were easier to plug in *and* had better feedback (audible) 

 

So no one really knows if they plugged it in "correctly" to begin with,  and that is again by design so hardly a "user error".

 

And also you didn't really acknowledge the argument that these plugs - due to their poor design - could come lose by themselves in "normal usage scenarios" ie simply due to their own weight and the clip being too small / poorly designed... which heavily implies no "user error" again. 

 

basically you cant come up with a new, not properly tested *new* design and when it fails simply claim "user error" thats just not how it works. 

 

 

1 hour ago, starsmine said:

PCI-SIG

Did they, or did they just approve it?

 

not sure why they're blaming nvidia then?

 

 

 

 

The direction tells you... the direction

-Scott Manley, 2021

 

Softwares used:

Corsair Link (Anime Edition) 

MSI Afterburner 

OpenRGB

Lively Wallpaper 

OBS Studio

Shutter Encoder

Avidemux

FSResizer

Audacity 

VLC

WMP

GIMP

HWiNFO64

Paint

3D Paint

GitHub Desktop 

Superposition 

Prime95

Aida64

GPUZ

CPUZ

Generic Logviewer

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

With these crazy current levels in modern computers, maybe it's time power supply manufacturers start adding in more protections to trip the power in such situations. As for seeing if there's more cases as more GPUs hit the market, it's no longer reliable data as it's contaminated by the fact people now know about this issue and will be paying attention to it during builds, so the numbers will be significantly lower than they would have been if this issue wasn't publicized. I also have to agree that while it's technically user error because people aren't plugging them in all the way, it's also a design issue because it shouldn't work if not fully plugged in and some of these cables can be a bear to get fully plugged in (and even worse to unplug), which encourages the user error. If seatbelts required excessive force and just the right angle to fasten and were difficult to tell if they're fastened, there would be a lot more people flying through windshields in accidents, but would that really be user error? I think most would see that as poor design leading to an inevitable failure on the users' end.

Link to comment
Share on other sites

Link to post
Share on other sites

On 12/7/2022 at 10:28 AM, LAwLz said:

Wait... So is PCI SIG saying that the design of the connector is not their responsibility? That can't be right. Surely things like the retention mechanism should be part of the standardized design in order to ensure that it actually works as intended even when mixing and matching products from different vendors. 

 

The testing and quality control I can understand as being up to the manufacturer, but part of the issue (the small issue I might add, there are very few reports of cables being burnt) is the design not giving enough feedback that the cable is fully secured. Surely that should be part of the standardized design and not left up to individual manufacturers to figure out. 

In principle it's possible to electronically detect that the connector is inserted badly given the sense connections. Ideally you'd give a warning or limit the power in those instances, but I think GPU and PSU manufacturers need to have a chat with each other and determine who implements which safety features with which negotiation strategy and which fallbacks.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, ImorallySourcedElectrons said:

In principle it's possible to electronically detect that the connector is inserted badly given the sense connections. Ideally you'd give a warning or limit the power in those instances, but I think GPU and PSU manufacturers need to have a chat with each other and determine who implements which safety features with which negotiation strategy and which fallbacks.

I think if they could do that on the GPU side then simply giving a standardized error message: "power con not connected properly" to the user and then until it is addressed it then limits it's own power draw to whatever the PCIe socket itself can deliver.     PSU makers would have to do nothing then making it one less product that needs to be updated.

 

 

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

Embedded thermocouples on the card side of the connector, card throttles when connector gets hot and throws an error drawn over whatever is displayed on screen. Super annoying, very hard to ignore, prevents melting, can be implemented with minor hardware and software changes from the card manufacturer, and boy could they ever market the crap out of how much safer it is now.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, mr moose said:

I think if they could do that on the GPU side then simply giving a standardized error message: "power con not connected properly" to the user and then until it is addressed it then limits it's own power draw to whatever the PCIe socket itself can deliver.     PSU makers would have to do nothing then making it one less product that needs to be updated.

 

 

To a certain degree, yes. But you are limited what you can do from the GPU side if the PSU doesn't behave in a certain way. Meanwhile, the PSU could fairly easy detect resistance increases which indicate cable or connector issues. Meanwhile, at the GPU side you might have to use tricks to get around that, and you wouldn't be able to detect each failure case.

1 hour ago, Bitter said:

Embedded thermocouples on the card side of the connector, card throttles when connector gets hot and throws an error drawn over whatever is displayed on screen. Super annoying, very hard to ignore, prevents melting, can be implemented with minor hardware and software changes from the card manufacturer, and boy could they ever market the crap out of how much safer it is now.

This might not help, between just running hot and failure the time might be surprisingly short.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, ImorallySourcedElectrons said:

To a certain degree, yes. But you are limited what you can do from the GPU side if the PSU doesn't behave in a certain way. Meanwhile, the PSU could fairly easy detect resistance increases which indicate cable or connector issues. Meanwhile, at the GPU side you might have to use tricks to get around that, and you wouldn't be able to detect each failure case.

This might not help, between just running hot and failure the time might be surprisingly short.

Give the GPU authority to quickly drop power or just shut off that power and down the system. At next boot it displays a warning before board post screen that there was a thermal event at the connector. Remember when GPU's used to put up a screen that you didn't connect PCIe power?  I'm sure they can figure out something, they're smart people right?

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Bitter said:

Give the GPU authority to quickly drop power or just shut off that power and down the system. At next boot it displays a warning before board post screen that there was a thermal event at the connector. Remember when GPU's used to put up a screen that you didn't connect PCIe power?  I'm sure they can figure out something, they're smart people right?

You're comparing very different failure modes. If you don't insert the external power cables you still get 75 W of power through the PCIe connector, which is plenty to run the graphics chip to display such a warning or to even go to desktop without significant graphics acceleration on newer cards. The connector not being plugged in is fairly easy to detect, just the +12V supply rails missing would do the job. Using the sense lines for that is kind of ridiculous to be honest, and a bloody waste of connector pins.

 

Meanwhile, the PSU could use the sense connections to signal the GPU that it has to draw less power in a fairly gradual way:

image.png.facd74498438f97b4aeed5278e248dbb.png

(Source: https://cdrdv2.intel.com/v1/dl/getcontent/336521 )

Meanwhile, all the card can do is tell the PSU "you're going out of spec", which ain't particularly 

 

However, this does not explain how the PSU could detect that. Assuming no manufacturers try something funny, all the 12V lines arriving at the connector are tied together. What you can in principle do from the PSU point of view is then use one of the six 12V or GND lines as sense line to detect cable/connector impedance changes. If you do not wish to lose the ability to provide power through said line, you can also slightly lower or increase the voltage on that one pin to get an idea of the impedance without reducing the amount of power available to the card. Doing this from the GPU's side is vastly more complicated and would most likely mean you'd lose a significant chunk of your maximum available power, additionally you wouldn't be able to detect every failure.

An alternative, from the GPU side, is just detecting the supply voltage at maximum load the PSU says it can supply. For example, if you're drawing 50 A at 12 V, you're probably going to see a significant voltage drop if the connector is not plugged in properly. But if it's kind of plugged in correctly a short burst might not detect that, and the issue would only happen once the connector is sufficiently heated. Additionally, the issue with this tactic is that your power supply ain't necessarily going to be happy with that, plus you'd be using your digital logic as resistive load of sorts, causing significant heating on the card.


I'm skimming over some of the details, but honestly they should have not cheaped out and implemented actual voltage sensing like proper power supplies for high currents purpose do. Then again, I wouldn't be surprised if companies like Seasonic already have some degree of voltage sensing at the load in place. (In fact, I assume that they do at these ridiculous currents.)

Link to comment
Share on other sites

Link to post
Share on other sites

26 minutes ago, Bitter said:

Give the GPU authority to quickly drop power or just shut off that power and down the system. At next boot it displays a warning before board post screen that there was a thermal event at the connector. Remember when GPU's used to put up a screen that you didn't connect PCIe power?  I'm sure they can figure out something, they're smart people right?

That's overlooking the fact that most PSU's are not connected to anything that can give that feedback. We're still dealing with PSU's the same way we dealt with them in 1987. Even the ATX12VO, has no signal sideband pins, and "optionally" has one of the 12V pins act as a sense pin.

 

Like there's some major oversight with the entire ATX 3.0 standard itself. Like ATX12VO also says a mains disconnect switch is OPTIONAL.

Even this...

image.thumb.png.3ca05177d9b803c8ec0ec97a6dc794c6.png

Why is this not "REQUIRED"? Who is blamed for charred/fused PCB conductor?

 

Like, we've gone around in circles in this thread and on this forum about who is to blame, why the failure happened, and a lot of it can be summed up with the statement of "manufacturers got too cheap"

 

Make the connector bigger, deeper, have the clip have audible and physical confirmation that it's connected. Anything we do with the present connector is pretty much going to amount to "blame the GPU manufacturer", and kick it over to them to either not use this version of the connector going forward, or employ resistance checking on power up and load balancing checking during operation. That of course is going to complicate the actual circuitry on the GPU's themselves, perhaps even negating any cost savings from the smaller connector.

 

Realistically, there was nothing wrong with the existing PCIe connectors, and "bigger" connectors are always harder to insert anyway (I'm looking at you USB3 with no clip), so perhaps the right solution would have been to take the existing 6+2 pin (those 2 pins are grounds used as sense pins) and add 4 pins to the other side the same way to create a 12-pin connector out of it.

 

Link to comment
Share on other sites

Link to post
Share on other sites

I'm not talking about the PSU communicating with the GPU at all.

 

GPU sees hot connector, GPU immediately throttles power draw or ceases drawing power from that connector. I'm sure it's doable with some software and the already present power circuitry on the card to tell the power phases drawing off the 12V connector to stop or rapidly ramp to zero or bear zero to remove the high current that's causing the heat. Pretty sure the phases can already be controlled if they're getting too hot or the GPU is drawing too much power, so the control is there they just have to add the code to make a thermal sensor at the connector trip it. Sensor could be on the pins at the PCB since metal is a good conductor of heat, one on the power plane and one on the ground plane would likely do it but upto one on each pin at the PCB could be done but would need more traces with diminished benefits.

 

Maybe it won't prevent all damage but if it can prevent or drastically lessen the chance of a fire or reduce the likelihood of card or connector damage it's worth it. Trip point can be a curve based on current draw up to say 20% under them max temp allowed or just a hard stop number. PSU OCP and short circuit detection doesn't prevent all fires or melted cables but it does dramatically reduce them from becoming house fires or sometimes taking out the PC in a fire if it's a dead short to ground.

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, Bitter said:

I'm not talking about the PSU communicating with the GPU at all.

 

GPU sees hot connector, GPU immediately throttles power draw or ceases drawing power from that connector. I'm sure it's doable with some software and the already present power circuitry on the card to tell the power phases drawing off the 12V connector to stop or rapidly ramp to zero or bear zero to remove the high current that's causing the heat. Pretty sure the phases can already be controlled if they're getting too hot or the GPU is drawing too much power, so the control is there they just have to add the code to make a thermal sensor at the connector trip it. Sensor could be on the pins at the PCB since metal is a good conductor of heat, one on the power plane and one on the ground plane would likely do it but upto one on each pin at the PCB could be done but would need more traces with diminished benefits.

 

Maybe it won't prevent all damage but if it can prevent or drastically lessen the chance of a fire or reduce the likelihood of card or connector damage it's worth it. Trip point can be a curve based on current draw up to say 20% under them max temp allowed or just a hard stop number. PSU OCP and short circuit detection doesn't prevent all fires or melted cables but it does dramatically reduce them from becoming house fires or sometimes taking out the PC in a fire if it's a dead short to ground.

And how do you propose the GPU sees the hot connector? It sounds simple in theory, but in practice it's a very difficult problem. 

Link to comment
Share on other sites

Link to post
Share on other sites

38 minutes ago, ImorallySourcedElectrons said:

And how do you propose the GPU sees the hot connector? It sounds simple in theory, but in practice it's a very difficult problem. 

That could be done with a temperature sensor, just like what's in/on the CPU and motherboard. But as was mentioned, once it's hot enough to detect a problem, the damage is probably already done. Using temperature for something like this probably wouldn't be effective, it would have to use current and/or load.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, vertigo220 said:

That could be done with a temperature sensor, just like what's in/on the CPU and motherboard. But as was mentioned, once it's hot enough to detect a problem, the damage is probably already done. Using temperature for something like this probably wouldn't be effective, it would have to use current and/or load.

The issue with connector failures is that they can be very sudden, connectors often are bistable mechanical systems. Once a certain threshold is reached, it can suddenly disengage very quickly. Hence why a proper voltage sensing topology is actually desirable. But I suspect it'd be very difficult to get such a standard passed given that it would invalidate a lot of previous designs and would make compatibility with previous topologies significantly more difficult.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, ImorallySourcedElectrons said:

The issue with connector failures is that they can be very sudden, connectors often are bistable mechanical systems. Once a certain threshold is reached, it can suddenly disengage very quickly. Hence why a proper voltage sensing topology is actually desirable. But I suspect it'd be very difficult to get such a standard passed given that it would invalidate a lot of previous designs and would make compatibility with previous topologies significantly more difficult.

Nah, just come out with ATX 3.0B or 3.1 or whatever.

 

Sucks to early invest in an ATX 3.0/ATX12VO PSU, but that doesn't mean you can't use it, it just means the dedicated 12-pin connector doesn't get used.  The PSU's still have PCIe power. Likewise if you want to stick that 4090 in a PC in three years, you'll need to keep the PCIe adapter.

 

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, ImorallySourcedElectrons said:

And how do you propose the GPU sees the hot connector? It sounds simple in theory, but in practice it's a very difficult problem. 

Thermocouple bonded to the power and ground planes of the board where the pins solder into the through holes. Metal pins are a known length and known thermal conductivity so you can run a delta to figure temp X temp at couple is Y temp in the area of the pins you're concerned about. T-couple feeds the card BIOS which controls the power phases, the BIOS can already throttle power based on phase temps among other things so just add some code for t-couple temp based power throttle or shutdown. Once the current is reduced or gone the very low thermal mass of the parts in question should halt the heating very quickly.

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Bitter said:

Thermocouple bonded to the power and ground planes of the board where the pins solder into the through holes.

That does nothing for any of the existing stuff already made. The Djinn was already let out of the bottle. Adding this stuff instead of fixing the connector just increases the cost and doesn't justify having the connector at all then.

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×