Jump to content

Gigabyte GTX 1660 Super Falls from the bus (Linux) [Solved by installing new GPU]]

Hello there.

Long time reader, first time poster, so first of all, let me thank y'all for being here and being awesome. 

Now, to my issue. 

A couple of days ago I assembled a new PC for my husband:

AMD Ryzen R5 3600

Gigabyte X570 Aorus Elite

Gigabyte GeForce GTX 1660 Super

Kingston HyperX Fury Black DDR4 2400MHz 32Gb

Thermaltake ToughPower Grand 850W

 

Please don't tell me that I messed up with a GPU for this system, I know that. But it was one of the very few options available at the moment. And it seems to be cursed. So, here's the issue.
The first boot went well, I ran a live iso of Linux Mint from the portable drive to see if everything was ok, then I left the machine running for a few hours. 
When I got back to it, the screen was black, as if the monitor fell asleep, I tried to wake it, but it didn't happen. So, I thought that it was nothing but the glitch, and rebooted the machine. 

Linux Mint installation process went great, we celebrated having a new electronic pet and went to bed, leaving the machine on. 

By the next morning the screen turned black and couldn't wake up. All the lights were fine, all the fans were spinning, I was even able to connect to the machine by ssh and run some stuff, but the screen did not turn on. 

So I rebooted the machine again and started digging the logs. 

The first thing that caught my attention was that Linux Mint Driver Manager did not detect the GPU model. After messing with drivers, it finally realized what the model was, updated the driver for the latest version, and I was hoping that the issue was resolved, so I ran a Phoronix benchmark (Unigine Heaven) to see if everything was ok. It started nicely with 150FPS, but in a few minutes I got 10FPS, then 2, and the monitor turned off again. 

The temperature was fine (around 70C), so I went for the logs to see if the system was able to catch any issues. 

 

Here's what I found:

Quote

Apr 29 18:27:49 DEN-BSVCHK kernel: [  436.880178] NVRM: GPU at PCI:0000:03:00: GPU-4e344340-f34e-ef33-be92-72891a24cff9
Apr 29 18:27:49 DEN-BSVCHK kernel: [  436.880182] NVRM: GPU Board Serial Number: 
Apr 29 18:27:49 DEN-BSVCHK kernel: [  436.880187] NVRM: Xid (PCI:0000:03:00): 79, pid=1089, GPU has fallen off the bus.
Apr 29 18:27:49 DEN-BSVCHK kernel: [  436.880190] NVRM: GPU 0000:03:00.0: GPU has fallen off the bus.
Apr 29 18:27:49 DEN-BSVCHK kernel: [  436.880192] NVRM: GPU 0000:03:00.0: GPU is on Board .
Apr 29 18:27:49 DEN-BSVCHK kernel: [  436.880202] NVRM: A GPU crash dump has been created. If possible, please run
Apr 29 18:27:49 DEN-BSVCHK kernel: [  436.880202] NVRM: nvidia-bug-report.sh as root to collect this data before
Apr 29 18:27:49 DEN-BSVCHK kernel: [  436.880202] NVRM: the NVIDIA kernel module is unloaded.
Apr 29 18:27:49 DEN-BSVCHK kernel: [  437.072174] nvidia-gpu 0000:03:00.3: Refused to change power state, currently in D3
Apr 29 18:27:49 DEN-BSVCHK kernel: [  437.135458] xhci_hcd 0000:03:00.2: Refused to change power state, currently in D3
Apr 29 18:27:49 DEN-BSVCHK upowerd[1354]: unhandled action 'offline' on /sys/devices/pci0000:00/0000:00:01.2/0000:01:00.0/0000:02:02.0/0000:03:00.2/usb1
Apr 29 18:27:49 DEN-BSVCHK kernel: [  437.216153] xhci_hcd 0000:03:00.2: Refused to change power state, currently in D3
Apr 29 18:27:49 DEN-BSVCHK kernel: [  437.216160] xhci_hcd 0000:03:00.2: Controller not ready at resume -19
Apr 29 18:27:49 DEN-BSVCHK kernel: [  437.216162] xhci_hcd 0000:03:00.2: PCI post-resume error -19!
Apr 29 18:27:49 DEN-BSVCHK kernel: [  437.216164] xhci_hcd 0000:03:00.2: HC died; cleaning up
Apr 29 18:27:59 DEN-BSVCHK kernel: [  446.740662] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:0:0:0x0000000f
Apr 29 18:28:09 DEN-BSVCHK kernel: [  456.741055] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c57e:0:0:0x0000000f
Apr 29 18:28:14 DEN-BSVCHK kernel: [  461.741105] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57e:0 2:0:4048:4032
Apr 29 18:28:19 DEN-BSVCHK kernel: [  466.740999] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57e:0 2:0:4048:4032
Apr 29 18:28:24 DEN-BSVCHK kernel: [  471.740893] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57e:0 2:0:4048:4032
Apr 29 18:28:29 DEN-BSVCHK kernel: [  476.740787] nvidia-modeset: ERROR: GPU:0: Error while waiting for GPU progress: 0x0000c57e:0 2:0:4048:4032

After several hours of googling, I found a thread from the developer's mailing list with a very similar error and no solution. There was something about the Linux kernel having issues with power management on NVidia drivers, which they all know of, but haven't fixed yet. 

The thing is, I was not able to find the list of GPUs affected by the bug, or any recent posts by people with the same issue. 

 

What bugs me, is that after the GPU falls off the bus, the reboot may help or may not help, sometimes the system fires up, but the screen stays off, then I have to shut the system down completely and start it with the button. It doesn't feel like a Linux driver issue. And this "glitch" when the system was running from the live iso, it was the same thing, but live iso doesn't use NVidia drivers, it uses nouveau, so, again, I believe it's a hardware problem. 

 

The main problem is, I can't figure out how to check if the GPU is faulty, or it's the mobo. I tried to plug the GPU to the second slot, the issue stays the same, and my husband gets grumpier with every reboot. Due to the pandemic situation, I have absolutely no way to ask someone for a mobo and gpu to test everything, and the return policies do not work now, so I can't send the GPU and/or mobo back, I can only order a new one, but I don't want to buy GPU AND mobo, it's going to be too tough for my budget. 

 

So any help is appreciated. How can I figure out which one is faulty? What tests can I perform?  Please, help! 

 

UPDATE:

It was a faulty GPU. 

Installed new AMD GPU, the problem is gone. 

Edited by BittaJam
Problem solved.
Link to comment
Share on other sites

Link to post
Share on other sites

Try the proprietary drivers, can't hurt right? I've had issues with nouveau drivers doing weird stuff on every post 900 series Nvidia GPU I've ever hooked up, although none quite this catastrophic.

¯\_(ツ)_/¯

 

 

Desktop:

Intel Core i7-11700K | Noctua NH-D15S chromax.black | ASUS ROG Strix Z590-E Gaming WiFi  | 32 GB G.SKILL TridentZ 3200 MHz | ASUS TUF Gaming RTX 3080 | 1TB Samsung 980 Pro M.2 PCIe 4.0 SSD | 2TB WD Blue M.2 SATA SSD | Seasonic Focus GX-850 Fractal Design Meshify C Windows 10 Pro

 

Laptop:

HP Omen 15 | AMD Ryzen 7 5800H | 16 GB 3200 MHz | Nvidia RTX 3060 | 1 TB WD Black PCIe 3.0 SSD | 512 GB Micron PCIe 3.0 SSD | Windows 11

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, BobVonBob said:

Try the proprietary drivers, can't hurt right? 

Sorry for not being quite clear about it (sleepless night because of all of that), when I wrote about the latest driver, I meant the latest NVidia driver, not nouveau. So this thing happens on both nouveau and proprietary drivers.

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, BittaJam said:

Sorry for not being quite clear about it (sleepless night because of all of that), when I wrote about the latest driver, I meant the latest NVidia driver, not nouveau. So this thing happens on both nouveau and proprietary drivers.

My mistake, didn't realize you had tested on a full install before trying the live iso. In that case I'm pretty stumped. I found this after a quick search of stackexchange, it might be of use. Basically says to try downgraded drivers.

https://askubuntu.com/questions/1190195/nvidia-gtx1650-gpu-has-fallen-off-the-bus

¯\_(ツ)_/¯

 

 

Desktop:

Intel Core i7-11700K | Noctua NH-D15S chromax.black | ASUS ROG Strix Z590-E Gaming WiFi  | 32 GB G.SKILL TridentZ 3200 MHz | ASUS TUF Gaming RTX 3080 | 1TB Samsung 980 Pro M.2 PCIe 4.0 SSD | 2TB WD Blue M.2 SATA SSD | Seasonic Focus GX-850 Fractal Design Meshify C Windows 10 Pro

 

Laptop:

HP Omen 15 | AMD Ryzen 7 5800H | 16 GB 3200 MHz | Nvidia RTX 3060 | 1 TB WD Black PCIe 3.0 SSD | 512 GB Micron PCIe 3.0 SSD | Windows 11

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, BobVonBob said:

Basically says to try downgraded drivers.

Yep, seen this. But it's not the same case. I checked suspend mode, it works absolutely beautiful, never caught a glimpse of the issue in about 20 attempts. 

Why I think that it's the hardware problem, it's because when I reboot the machine, the screen may not turn on, in that case I need to shutdown the machine, switch the power off and turn it back on. That's why I am afraid the motherboard might be the reason for all that. But again, it's just a guess and I can't figure out how to test it.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×