Jump to content

GPU problems in F@H

The problem:
F@H disabled both GPU slots and the GPUs aren't recognized by several programs.

 

How I caused it:
I tried to enable Coolbits following this Guide: https://www.techticity.com/howto/how-to-control-nvidia-graphics-card-fan-speed-in-linux/
After creating the 20-nvidia.conf file (with the contents from the guide) in etc/X11/xorg.conf.d, I rebooted, but the OS wouldn't load properly anymore (it got stuck at the wall of text that shows all the services and drivers starting), and eventually crashed, afterwards the PC wouldn't power on anymore and showed no signs of life. After resetting CMOS it powers on again, but the GPUs don't show any output, only the iGPU works (though the BIOS still tried outputting through GPU 1, I  had to set the iGPU as the primary display to get an output at all [I pulled the GPUs for that]).

 

What I tried:
I reverted all the changes I made (deleted the 20-nvidia.conf file) and reinstalled F@H completely, as well as tried switching the GPUs around into all configurations imaginable, but the GPUs are still disabled, F@H does correctly recognize them though.
NVIDIA X-Server Settings doesn't see the GPUs at all, neither does PSensor.

 

Additional info:
I am using the proprietary Nvidia driver, which worked fine until now.
I noticed that when the system powers off, a text saying that Nvidia-Persistenced-Service failed pops up for a short time before the machine shuts off completely.

 

Specs:
CPU: i7 4790K
RAM: 2x4GB DDR3-1600
MB: asus H87M-Pro
GPU 1: RTX 2070 Super
GPU 2: GTX 1060 3GB
PSU: EVGA 750BR (750W)
SSD: LITEON 256GB SATA SSD
OS: Debian 12
F@H Client version: 7.6.21

 

Any ideas on how to fix this are greatly appreciated!

 

English is not my first language, so please excuse any confusion or misunderstandings on my end.

I like to edit my posts a lot.

 

F@H-Stats

The Folding rig:

CPU: Core i7 4790K

RAM: 16 8GB (2x4GB) DDR3-1600

GPU 1: RTX 2070 Super

GPU 2: GTX 1060 3GB

PSU: Gigabyte P450B EVGA 600BR EVGA 750BR

Cooling: 2x Delta GFB1212VHG w. PWM

OS: Windows 11 Home

 

Linux let me down.

.- -- --- --. ..- ...         

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Hello!

Link to comment
Share on other sites

Link to post
Share on other sites

Did you try completely removing and reinstalling the Nvidia driver package?

I sold my soul for ProSupport.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Needfuldoer said:

Did you try completely removing and reinstalling the Nvidia driver package?

Not yet, I'll try that once I'm back home.

English is not my first language, so please excuse any confusion or misunderstandings on my end.

I like to edit my posts a lot.

 

F@H-Stats

The Folding rig:

CPU: Core i7 4790K

RAM: 16 8GB (2x4GB) DDR3-1600

GPU 1: RTX 2070 Super

GPU 2: GTX 1060 3GB

PSU: Gigabyte P450B EVGA 600BR EVGA 750BR

Cooling: 2x Delta GFB1212VHG w. PWM

OS: Windows 11 Home

 

Linux let me down.

.- -- --- --. ..- ...         

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Hello!

Link to comment
Share on other sites

Link to post
Share on other sites

The not powering on is a bit odd though.  Even with borked Xorg/drivers it should still have booted you into text mode without crashing.

 

If its any use, the settings on my desktop right now are:

cat /etc/X11/xorg.conf.d/01-nvidia.conf 
Section "Monitor"
    # HorizSync source: edid, VertRefresh source: edid
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "GBT M28U"
    HorizSync       246.0 - 246.0
    VertRefresh     48.0 - 144.0
    Option         "DPMS"
EndSection

Section "Device"
        Identifier     "Device0"
        Driver         "nvidia"
        VendorName     "NVIDIA Corporation"
        Option "NoLogo" "1"
        Option "Coolbits" "12"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "Stereo" "0"
    Option         "nvidiaXineramaInfoOrder" "DFP-3"
    Option         "metamodes" "3840x2160_120 +0+0 {ForceCompositionPipeline=On, AllowGSYNCCompatible=On}"
    Option         "SLI" "Off"
    Option         "MultiGPU" "Off"
    Option         "BaseMosaic" "off"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

I'm curious how were supposed to enable coolbits on Wayland, as this will be an issue very soon.

 

Someone claims this works without coolbits: https://github.com/nan0s7/nfancurve

Must say I'm dubious given I always used the nvidia-settings CLI and fairly sure coolbits was required.

Router:  Intel N100 (pfSense) WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz)
WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz) Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~800Mbit down, 115Mbit up)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, Alex Atkin UK said:

The not powering on is a bit odd though.  Even with borked Xorg/drivers it should still have booted you into text mode without crashing.

I have no idea what caused that, it just did absolutely nothing when pressing the power button, until I reset the CMOS with the jumper.

After that, it rebootet a few times until it stayed on without giving a display output.

English is not my first language, so please excuse any confusion or misunderstandings on my end.

I like to edit my posts a lot.

 

F@H-Stats

The Folding rig:

CPU: Core i7 4790K

RAM: 16 8GB (2x4GB) DDR3-1600

GPU 1: RTX 2070 Super

GPU 2: GTX 1060 3GB

PSU: Gigabyte P450B EVGA 600BR EVGA 750BR

Cooling: 2x Delta GFB1212VHG w. PWM

OS: Windows 11 Home

 

Linux let me down.

.- -- --- --. ..- ...         

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Hello!

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, Average Nerd said:

I have no idea what caused that, it just did absolutely nothing when pressing the power button, until I reset the CMOS with the jumper.

Could the CMOS battery be failing?

Router:  Intel N100 (pfSense) WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz)
WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz) Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~800Mbit down, 115Mbit up)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, Alex Atkin UK said:

Could the CMOS battery be failing?

I replaced it last year because it was flat, and it does retain settings with no problems, even when unplugged for longer periods of time (the computer boots just fine now, and it has never done that before).

I'll check if it's still good once I'm home.

English is not my first language, so please excuse any confusion or misunderstandings on my end.

I like to edit my posts a lot.

 

F@H-Stats

The Folding rig:

CPU: Core i7 4790K

RAM: 16 8GB (2x4GB) DDR3-1600

GPU 1: RTX 2070 Super

GPU 2: GTX 1060 3GB

PSU: Gigabyte P450B EVGA 600BR EVGA 750BR

Cooling: 2x Delta GFB1212VHG w. PWM

OS: Windows 11 Home

 

Linux let me down.

.- -- --- --. ..- ...         

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Hello!

Link to comment
Share on other sites

Link to post
Share on other sites

Posted (edited)
5 hours ago, Needfuldoer said:

Did you try completely removing and reinstalling the Nvidia driver package?

I did now, it's still dead (it still says "[Failed] Nvidia-Persistenced-Service" during startup, trying to start it after boot doesn't work and just points me towards "syslog", which I can't find)

I used the following command for removal:

apt purge "*nvidia*"

And this one for reinstallation:

apt install nvidia-driver firmware-misc-nonfree

 

 

Edited by Average Nerd

English is not my first language, so please excuse any confusion or misunderstandings on my end.

I like to edit my posts a lot.

 

F@H-Stats

The Folding rig:

CPU: Core i7 4790K

RAM: 16 8GB (2x4GB) DDR3-1600

GPU 1: RTX 2070 Super

GPU 2: GTX 1060 3GB

PSU: Gigabyte P450B EVGA 600BR EVGA 750BR

Cooling: 2x Delta GFB1212VHG w. PWM

OS: Windows 11 Home

 

Linux let me down.

.- -- --- --. ..- ...         

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Hello!

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Alex Atkin UK said:

I'm curious how were supposed to enable coolbits on Wayland, as this will be an issue very soon.

According to the info section in the Settings, my system is on Wayland already, so might that have contributed?

3 hours ago, Alex Atkin UK said:

Someone claims this works without coolbits: https://github.com/nan0s7/nfancurve

I don't really care about custom fan control, my original plan was to lower the power limits of the GPUs because they made the whole story of the house I'm in insufferably hot.

 

EDIT: The CMOS Battery is fine, it's at 3,29V.

English is not my first language, so please excuse any confusion or misunderstandings on my end.

I like to edit my posts a lot.

 

F@H-Stats

The Folding rig:

CPU: Core i7 4790K

RAM: 16 8GB (2x4GB) DDR3-1600

GPU 1: RTX 2070 Super

GPU 2: GTX 1060 3GB

PSU: Gigabyte P450B EVGA 600BR EVGA 750BR

Cooling: 2x Delta GFB1212VHG w. PWM

OS: Windows 11 Home

 

Linux let me down.

.- -- --- --. ..- ...         

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Hello!

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, Average Nerd said:

According to the info section in the Settings, my system is on Wayland already, so might that have contributed?

I don't really care about custom fan control, my original plan was to lower the power limits of the GPUs because they made the whole story of the house I'm in insufferably hot.

 

EDIT: The CMOS Battery is fine, it's at 3,29V.

You can change the clock speeds from the CLI which kinda works better as rather than limiting the TDP you reduce the clock speed until its at a more efficient point on the voltage curve.

 

Quote

# this is likely what nvidia-persistanced is already doing

sudo nvidia-smi -pm 1

# -i is card number starting from 0, -lgc is min_clock,max-clock in Mhz

# 2400 is around optimal efficiency on 4000 series in my testing
sudo nvidia-smi -i 0 -lgc 0,2400

 

You can then check power consumption by just issuing nvidia-smi without any parameters, or custom output like:

nvidia-smi --query-gpu=gpu_bus_id,timestamp,driver_version,pcie.link.gen.current,pcie.link.width.current,temperature.gpu,fan.speed,utilization.gpu,utilization.memory,memory.used,memory.free,clocks.current.graphics,clocks.current.sm,clocks.current.memory,power.draw --format=csv,noheader,nounits

I utilise this on my Folding@Home page.

 

Its not as good as on Windows where you can actually change the voltage curve (OC Scanner in Afterburner), but useful none the less.

Router:  Intel N100 (pfSense) WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz)
WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz) Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~800Mbit down, 115Mbit up)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to comment
Share on other sites

Link to post
Share on other sites

Posted (edited)
9 minutes ago, Alex Atkin UK said:

You can change the clock speeds from the CLI which kinda works better as rather than limiting the TDP you reduce the clock speed until its at a more efficient point on the voltage curve.

 

 

You can then check power consumption by just issuing nvidia-smi without any parameters, or custom output like:

nvidia-smi --query gpu=gpu_bus_id,timestamp,driver_version,pcie.link.gen.current,pcie.link.width.current,temperature.gpu,fan.speed,utilization.gpu,utilization.memory,memory.used,memory.free,clocks.current.graphics,clocks.current.sm,clocks.current.memory,power.draw

I utilise this on my Folding@Home page.

 

Its not as good as on Windows where you can actually change the voltage curve (OC Scanner in Afterburner), but useful none the less.

Okay, that's good to know for when the machine works again, Thanks!

(Question: how do I determine a good clockspeed, are there some resources or is it just trial and error?)

Edited by Average Nerd

English is not my first language, so please excuse any confusion or misunderstandings on my end.

I like to edit my posts a lot.

 

F@H-Stats

The Folding rig:

CPU: Core i7 4790K

RAM: 16 8GB (2x4GB) DDR3-1600

GPU 1: RTX 2070 Super

GPU 2: GTX 1060 3GB

PSU: Gigabyte P450B EVGA 600BR EVGA 750BR

Cooling: 2x Delta GFB1212VHG w. PWM

OS: Windows 11 Home

 

Linux let me down.

.- -- --- --. ..- ...         

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Hello!

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Average Nerd said:

Okay, that's good to know for when the machine works again, Thanks!

(Question: how do I determine a good clockspeed, are there some resources or is it just trial and error?)

Trial and error.  On Windows I was able to view the voltage curve and see where the voltage starts to really climb, it also shows where on the curve its currently running.

On Linux I just kept tweaking it based on watts vs PPD until it hit somewhere I was happy with.  Of course its best to do this with a higher scoring WU.

The setting is lost between reboots and as it needs to run as root, I didn't automate it.  I should probably get round to adding nvidia-smi to the sudoers file so I can do that.

Router:  Intel N100 (pfSense) WiFi6: Zyxel NWA210AX (1.7Gbit peak at 160Mhz)
WiFi5: Ubiquiti NanoHD OpenWRT (~500Mbit at 80Mhz) Switches: Netgear MS510TXUP, MS510TXPP, GS110EMX
ISPs: Zen Full Fibre 900 (~930Mbit down, 115Mbit up) + Three 5G (~800Mbit down, 115Mbit up)
Upgrading Laptop/Desktop CNVIo WiFi 5 cards to PCIe WiFi6e/7

Link to comment
Share on other sites

Link to post
Share on other sites

On 4/16/2024 at 1:22 PM, Needfuldoer said:

Did you try completely removing and reinstalling the Nvidia driver package?

I have completely reinstalled the OS, and the issue persists. I don't get it, both of the GPUs used to work in Linux, and they still work just fine in windows (tested both with furmark).

English is not my first language, so please excuse any confusion or misunderstandings on my end.

I like to edit my posts a lot.

 

F@H-Stats

The Folding rig:

CPU: Core i7 4790K

RAM: 16 8GB (2x4GB) DDR3-1600

GPU 1: RTX 2070 Super

GPU 2: GTX 1060 3GB

PSU: Gigabyte P450B EVGA 600BR EVGA 750BR

Cooling: 2x Delta GFB1212VHG w. PWM

OS: Windows 11 Home

 

Linux let me down.

.- -- --- --. ..- ...         

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Hello!

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×