Do I need to be worried about premature failure in a laptop with a BGA CPU and a cooler with a low thermal mass?

2K6Ejmt6IV72L8fwHZ5sPhtP6L · January 20, 2021

My previous laptop was retired prematurely because the GPU had failed. The laptop would thermal trip regardless of temperature on any boot attempt not immediately following a CMOS reset, not display any video at all on about 30% of successful boot attempts, and if it did display video it would be covered in artifacts. Additionally, any attempt at using accelerated graphics would cause the whole thing to crash and text mode characters are garbled to the point of becoming gibberish. Basically, the GPU on that laptop is now only useful for getting into the BIOS (which is very visually distorted but just barely usable), but otherwise the computer is essentially headless, (That laptop has discrete graphics only, so "just switching to the integrated graphics" is impossible.)

My research suggests (and feel free to correct me if I'm wrong, because beyond this point I don't really know what I'm talking about) that issues like artifacting tend to occur as a result of data integrity issues between the GPU die and other graphics components, while silicon degradation would cause it to not work at all or only at a reduced frequency and/or increased voltage. Additionally, silicon degradation appears to only really happen at all if running the chip far beyond reasonable voltage, current, or temperature, and I was running it with default settings at reasonable temps. I don't know what the previous owner did (it was a used laptop), but as it was a high end mobile workstation I'd guess he/she/they also ran it at default settings for maximum reliability. However, I have heard that the connections can break when repeated thermal cycling causes the die, substrate, and mobo to expand and contract at different rates, eventually putting too much stress on them, and broken connections would certainly cause signal integrity issues. Additionally, the first symptoms appeared a few weeks after a repaste, which suggests that the GPU was already close to failure and removing or reinstalling the cooler provided just enough force to finally break it. So I think it is most likely that the laptop's GPU died from excessive thermal cycling.

That laptop's CPU and GPU were almost always at very similar temperatures even with significant load differences, which I would guess would be caused by them shared a cooling system, so I would assume that they both experienced similar thermal cycling. However, the CPU remains usable and displays no signs of failure. The CPU also has a much larger die but a similar package size, which would exacerbate any thermal expansion differences between the die and substrate. I would also guess that with a similar package size, they would have a similar number of pins, though I can't find a pin count for the GPU to confirm this. The only major difference I could think of that could make the GPU fail first is the fact that the GPU is BGA while the CPU is PGA. And that kinda makes sense because BGA has everything stiffly in place so stress will cause breakages, while PGA has flexible pins in holes with wiggle room so stress will just bend or move the pins a bit. (I would imagine that LGA would have a similar advantage to PGA.)

I have a new laptop, but I am worried that my new laptop (and almost all new laptops for that matter, not just mine) has the same design flaw with the CPU, but much worse because:

The CPU in my new laptop for some reason has a much larger package size than the CPU or GPU in my old laptop, exacerbating thermal expansion between the package and the mobo. (Though I think the die is smaller, so there shouldn't be as many issues between the die and substrate.)
Although the new laptop produces at least 6x less heat than the old one and runs at significantly lower temperatures in general, the cooling system has a much lower thermal mass, leading to temps being able to swing up or down by as much as 50°C in <10 seconds. (My old one took maybe 10 minutes to heat up and another 10 minutes to cool down.) I'm concerned that this will result in much faster and much more frequent thermal expansion and contraction.
While my GPU failure left my laptop usable with artifacts and sleep-wake issues for months before it finally became unusable, having this issue with the CPU will likely render the computer unusable as soon as it first appears due to crashing, buggy behavior, or data corruption.

So do I have to worry about my new laptop failing prematurely in the same way that my old one did, possibly in even less time? Or is there some recent advancement that made this much less likely to happen, simple probability, or a false assumption I made that results in me not having to be worried? Additionally, I want to keep using my old laptop as a server because the CPU still works. It won't be used for anything mission critical (I wouldn't use a laptop for something like that at all, especially a broken one), but does the fact that the GPU failed imply that a similar failure on the CPU is only a few days/weeks/months away, or are the parts different enough that I can assume the CPU is still completely fine?

Grabhanem · January 20, 2021

The CPU manufacturers design their CPUs to be able to handle the stresses they'd experience within their rated temperature range. As long as your temps aren't objectively bad, you don't have too much to worry about.

2K6Ejmt6IV72L8fwHZ5sPhtP6L · January 20, 2021

Just now, Grabhanem said:

The CPU manufacturers design their CPUs to be able to handle the stresses they'd experience within their rated temperature range. As long as your temps aren't objectively bad, you don't have too much to worry about.

Shouldn't Nvidia, the manufacturer of my GPU, have done that too? Or are they just crap?

I think my temps were OK. The old one's CPU and GPU both idled at around 60°C, stayed at around 70-75°C under normal use, and typically went up to 80-85°C under heavy load. My new laptop idles at around 35-40 and stays at around 45-50 under normal use, but still hits 80-85 under heavy load, though heavy load that causes those temps occurs much less frequently on the new one.

Electronics Wizardy · January 20, 2021

1 minute ago, 2K6Ejmt6IV72L8fwHZ5sPhtP6L said:

Shouldn't Nvidia, the manufacturer of my GPU, have done that too? Or are they just crap?

I think my temps were OK. The old one's CPU and GPU both idled at around 60°C, stayed at around 70-75°C under normal use, and typically went up to 80-85°C under heavy load. My new laptop idles at around 35-40 and stays at around 45-50 under normal use, but still hits 80-85 under heavy load, though heavy load that causes those temps occurs much less frequently on the new one.

well some failure is random, so we don't know if you just got unlucky, but thats most likely the case, as most laptops are too slow before they die.

BUt those temps are all fine, dont worry about it.

Grabhanem · January 20, 2021

Just now, 2K6Ejmt6IV72L8fwHZ5sPhtP6L said:

Shouldn't Nvidia, the manufacturer of my GPU, have done that too? Or are they just crap?

I think my temps were OK. The old one's CPU and GPU both idled at around 60°C, stayed at around 70-75°C under normal use, and typically went up to 80-85°C under heavy load. My new laptop idles at around 35-40 and stays at around 45-50 under normal use, but still hits 80-85 under heavy load, though heavy load that causes those temps occurs much less frequently on the new one.

If those were your core temps, I'd suspect the graphics memory (which is typically not monitored) was running out of spec and gave up. Unless your GPU was defective, those temps shouldn't cause a problem.

2K6Ejmt6IV72L8fwHZ5sPhtP6L · January 20, 2021

1 minute ago, Electronics Wizardy said:

well some failure is random, so we don't know if you just got unlucky, but thats most likely the case, as most laptops are too slow before they die.

BUt those temps are all fine, dont worry about it.

OK, that makes sense.

1 minute ago, Grabhanem said:

If those were your core temps, I'd suspect the graphics memory (which is typically not monitored) was running out of spec and gave up. Unless your GPU was defective, those temps shouldn't cause a problem.

I don't know whether the GPU temperature was core or package. The GPU just exposed one temperature sensor, which was simply called something like "nvidia-smi" that gave no information about where on the GPU it actually is. However, the CPU readings were package, not core.

Roswell · January 21, 2021

Bad VRAM is by far the biggest culprit in GPU failure. Manufacturers push the memory to the absolute limits and it suffers a shorter lifespan because of it.

Forget about your temperatures if they're within spec and just use your notebook. No reason to worry about things out of your control.

2K6Ejmt6IV72L8fwHZ5sPhtP6L · January 21, 2021

3 hours ago, Vitamanic said:

Bad VRAM is by far the biggest culprit in GPU failure. Manufacturers push the memory to the absolute limits and it suffers a shorter lifespan because of it.

Forget about your temperatures if they're within spec and just use your notebook. No reason to worry about things out of your control.

Yay for planned obsolescence ruining an otherwise perfectly good GPU! Is there something like memtest86+ but for VRAM so I can see if that's the case?

Since my new laptop doesn't have a discrete GPU, I guess I don't need to worry about grilled VRAM, so that's good.

Roswell · January 21, 2021

50 minutes ago, 2K6Ejmt6IV72L8fwHZ5sPhtP6L said:

Yay for planned obsolescence ruining an otherwise perfectly good GPU! Is there something like memtest86+ but for VRAM so I can see if that's the case?

Since my new laptop doesn't have a discrete GPU, I guess I don't need to worry about grilled VRAM, so that's good.

Yeah, OCCT has a VRAM error check. There are others but I'd stick with OCCT since others seemingly don't support memory sizes over 3.5GB.

2K6Ejmt6IV72L8fwHZ5sPhtP6L · January 21, 2021

8 hours ago, Vitamanic said:

Yeah, OCCT has a VRAM error check. There are others but I'd stick with OCCT since others seemingly don't support memory sizes over 3.5GB.

OCCT unfortunately looks to be Windows only and I'm not going to install Windows just for this, but it served as a good starting point to look for alternatives. I found this one which appears to be bootable and this one which I can run on the command line over SSH once I have that set up and not rely on the laptop's video output to control it, both of which are more convenient. Supporting more than 3.5GB isn't a concern as that laptop has 1GB of VRAM.

Sign In

Do I need to be worried about premature failure in a laptop with a BGA CPU and a cooler with a low thermal mass?

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

He Spent 3 YEARS Begging me for a PC. Good Luck Finding it!

Latest From Tech Quickie:

The NEW Chip Inside Your Phone! (NPUs)

Latest From TechLinked:

YouTube Doubles Down

Latest From GameLinked:

The next Must-Play RPGs

Latest From ShortCircuit:

You Deserve this much OLED - AORUS CO49DQ

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!