Jump to content

UnRAID VM with GPU passthrough crashes system

xl3b4n0nx

I am trying to setup a Ubuntu VM with a Nvidia GPU passed through. The GPU is not the primary GPU in the system so passthrough shouldn't be an issue. I am running a Threadripper 1920X on the AsRock X399 Fatal1ty Professional Gaming. The GPUs in the system are Quadro K4000 (primary), GT 730, Quadro K2000. I am trying to pass through the Quadro K2000. When I start the VM the VM and docker managers hang then the whole system hangs. A clean shutdown is not possible. The error in the system log is libvirtd tainted and then a stack trace which clearly indicates the GPU is the problem. For reference, I have other VM's running without GPUs with no issue at all. UnRAID Nvidia version 6.8.2. Diagnostics are attached.

 

VM settings:

CPU Mode: Host Passthrough

Machine: Q35-4.2

USB Controller: 2.0

GPU 1: VNC

GPU 2: Quadro K2000

Sound Card: Quadro audio

tower-diagnostics-20200209-1621.zip

Link to comment
Share on other sites

Link to post
Share on other sites

Have you enabled the appropriate BIOS settings? IOMMU, Enumerate all IOMMU in IVRS, possibly changing Memory Interleave?

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Windows7ge said:

Have you enabled the appropriate BIOS settings? IOMMU, Enumerate all IOMMU in IVRS, possibly changing Memory Interleave?

IOMMU is on. Where is memory interleave? How would it effect this?

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, xl3b4n0nx said:

IOMMU is on. Where is memory interleave? How would it effect this?

Enumerate all IOMMU in IVRS would probably have more impact. On TR with multiple physical dies AMD has set it up where enabling IOMMU only enables mapping all device visible virtual addresses on die 0. Enabling Enumerate all IOMMU in IVRS enables it on both dies. From my own experience passing a GPU to a VM on my TR1950X this setting was necessary.

 

As for Memory Interleave it's more for performance optimization and may not help your issue. Changing Memory Interleave to Channel allows the hardware to become NUMA aware. This allows you to pin CPU cores to the CPU die that is directly connected to the GPU without requests having to cross the Infinity Fabric.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Windows7ge said:

Enumerate all IOMMU in IVRS would probably have more impact. On TR with multiple physical dies AMD has set it up where enabling IOMMU only enables mapping all device visible virtual addresses on die 0. Enabling Enumerate all IOMMU in IVRS enables it on both dies. From my own experience passing a GPU to a VM on my TR1950X this setting was necessary.

 

As for Memory Interleave it's more for performance optimization and may not help your issue. Changing Memory Interleave to Channel allows the hardware to become NUMA aware. This allows you to pin CPU cores to the CPU die that is directly connected to the GPU without requests having to cross the Infinity Fabric.

Ohh ok. That makes sense. I was wondering if pinning CPU cores to a NUMA node was possible. I tried to assign cores based on which die they were on. I will give these a shot. Thanks!

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Windows7ge said:

Enumerate all IOMMU in IVRS would probably have more impact. On TR with multiple physical dies AMD has set it up where enabling IOMMU only enables mapping all device visible virtual addresses on die 0. Enabling Enumerate all IOMMU in IVRS enables it on both dies. From my own experience passing a GPU to a VM on my TR1950X this setting was necessary.

 

As for Memory Interleave it's more for performance optimization and may not help your issue. Changing Memory Interleave to Channel allows the hardware to become NUMA aware. This allows you to pin CPU cores to the CPU die that is directly connected to the GPU without requests having to cross the Infinity Fabric.

Do you know where the IVRS setting might be in an AsRock BIOS? I can't find the setting. 

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, xl3b4n0nx said:

Do you know where the IVRS setting might be in an AsRock BIOS? I can't find the setting. 

I know it's under the Advanced Menu. For my BIOS I have some sub-directory menus within there that open up to many deep configuration settings including IVRS & Memory Interleave.

 

One of those should be under the North Bridge settings.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Windows7ge said:

I know it's under the Advanced Menu. For my BIOS I have some sub-directory menus within there that open up to many deep configuration settings including IVRS & Memory Interleave.

 

One of those should be under the North Bridge settings.

In advanced for me I have the North bridge settings. It has IOMMU and SR-IOV, both of which are enabled. I did find the Memory Interleave in AMD CBS.

Link to comment
Share on other sites

Link to post
Share on other sites

I have done this setup with only a single Nvidia GPU. In my case I even passed it through from UnRaid to Ubuntu VM to Docker to Plex. 

 

First off as above, you need to make sure you have IOMMU enabled which you can confirm on the summary on your UnRAID dashboard.

Secondly, im not sure if this is required but I have a single GPU and it didnt work until I got the Unraid Nvidia build, which includes Nvidia drivers and other components. You can get it through the Community Apps https://forums.unraid.net/topic/38582-plug-in-community-applications/  (just realised you already have the Nvidia build)

 

Then create your Ubuntu VM, and assign the Nvidia GPU (in the drop down list, note down the 2 digit value of its address. I have mine assigned as a secondary so I can use VNC viewer as the Primary.  (As I had a single GPU i had to dump my VBIOS but you can skip in your case as long as its not the primary)

Once its been created, I edit the VM and set it to Advanced XML view. Here you want to change your Nvidia configuration. You need to search through the XML list looking for a slot id that matches the address value as above. Once you find that line, you want to make some changes

 

First the video sections will look something like this, where the address value in this example is '06'

<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>

 

You want to add the multifunction value, so it looks like below

 

<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0' multifunction='on'/>

 

Then further down you will see the audio section of the card will be a number higher and look something like this

 

<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>

You want to change it so that its the same slot number as your graphics card (in our example its 0x06), but it will be function 1. so will look like this

<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x1'/>

 

You should then be able to boot and see the GPU hopefully. You can use either of these commands depending..

sudo ubuntu-drivers devices
lspci | grep -i VGA

 

If all good, then you just need to install your nvidia drivers & nvidia-smi using

 

sudo ubuntu-drivers autoinstall && sudo apt install -y nvidia-smi

 

Spoiler

Desktop: Ryzen9 5950X | ASUS ROG Crosshair VIII Hero (Wifi) | EVGA RTX 3080Ti FTW3 | 32GB (2x16GB) Corsair Dominator Platinum RGB Pro 3600Mhz | EKWB EK-AIO 360D-RGB | EKWB EK-Vardar RGB Fans | 1TB Samsung 980 Pro, 4TB Samsung 980 Pro | Corsair 5000D Airflow | Corsair HX850 Platinum PSU | Asus ROG 42" OLED PG42UQ + LG 32" 32GK850G Monitor | Roccat Vulcan TKL Pro Keyboard | Logitech G Pro X Superlight  | MicroLab Solo 7C Speakers | Audio-Technica ATH-M50xBT2 LE Headphones | TC-Helicon GoXLR | Audio-Technica AT2035 | LTT Desk Mat | XBOX-X Controller | Windows 11 Pro

 

Spoiler

Server: Fractal Design Define R6 | Ryzen 3950x | ASRock X570 Taichi | EVGA GTX1070 FTW | 64GB (4x16GB) Corsair Vengeance LPX 3000Mhz | Corsair RM850v2 PSU | Fractal S36 Triple AIO | 12 x 8TB HGST Ultrastar He10 (WD Whitelabel) | 500GB Aorus Gen4 NVMe | 2 x 2TB Samsung 970 Evo Plus NVMe | LSI 9211-8i HBA

 

Link to comment
Share on other sites

Link to post
Share on other sites

Just looking through your logs, your IOMMU groups look fine

 

Quote

42:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK208B [GeForce GT 730] [10de:1287] (rev a1)
    Subsystem: PNY GK208 [GeForce GT 730] [196e:1127]
    Kernel driver in use: nvidia
    Kernel modules: nvidia_drm, nvidia
42:00.1 Audio device [0403]: NVIDIA Corporation GK208 HDMI/DP Audio Controller [10de:0e0f] (rev a1)
    Subsystem: PNY GK208 HDMI/DP Audio Controller [196e:1127]
43:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK107GL [Quadro K2000] [10de:0ffe] (rev a1)
    Subsystem: NVIDIA Corporation GK107GL [Quadro K2000] [10de:094c]
    Kernel modules: nvidia_drm, nvidia
43:00.1 Audio device [0403]: NVIDIA Corporation GK107 HDMI Audio Controller [10de:0e1b] (rev a1)
    Subsystem: NVIDIA Corporation GK107 HDMI Audio Controller [10de:094c]

Quote

/sys/kernel/iommu_groups/34/devices/0000:43:00.0
/sys/kernel/iommu_groups/34/devices/0000:43:00.1

 

From your syslog I can see this error

 

NVRM: Attempting to remove minor device 2 with non-zero usage count!

Which indicates its bound somewhere, you havent bound it to a Docker container or something?

It seems to me its using the GT730 as the main GPU so its odd. You could try dumping the VBIOS (or download one from TechPowerUp and specify the VBIOS in your VM's configuration. 

 

Alternatively you could try and manually unbind it, and bind it to VFIO. 

 

SSH into your UnRaid and something like this:

 

echo 0000:43:00.0 > /sys/bus/pci/drivers/nvidia/unbind
echo 0000:43:00.1 > /sys/bus/pci/drivers/nvidia/unbind
echo 0000:43:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
echo 0000:43:00.1 > /sys/bus/pci/drivers/vfio-pci/bind

 

Spoiler

Desktop: Ryzen9 5950X | ASUS ROG Crosshair VIII Hero (Wifi) | EVGA RTX 3080Ti FTW3 | 32GB (2x16GB) Corsair Dominator Platinum RGB Pro 3600Mhz | EKWB EK-AIO 360D-RGB | EKWB EK-Vardar RGB Fans | 1TB Samsung 980 Pro, 4TB Samsung 980 Pro | Corsair 5000D Airflow | Corsair HX850 Platinum PSU | Asus ROG 42" OLED PG42UQ + LG 32" 32GK850G Monitor | Roccat Vulcan TKL Pro Keyboard | Logitech G Pro X Superlight  | MicroLab Solo 7C Speakers | Audio-Technica ATH-M50xBT2 LE Headphones | TC-Helicon GoXLR | Audio-Technica AT2035 | LTT Desk Mat | XBOX-X Controller | Windows 11 Pro

 

Spoiler

Server: Fractal Design Define R6 | Ryzen 3950x | ASRock X570 Taichi | EVGA GTX1070 FTW | 64GB (4x16GB) Corsair Vengeance LPX 3000Mhz | Corsair RM850v2 PSU | Fractal S36 Triple AIO | 12 x 8TB HGST Ultrastar He10 (WD Whitelabel) | 500GB Aorus Gen4 NVMe | 2 x 2TB Samsung 970 Evo Plus NVMe | LSI 9211-8i HBA

 

Link to comment
Share on other sites

Link to post
Share on other sites

42 minutes ago, Jarsky said:

Just looking through your logs, your IOMMU groups look fine

 

 

From your syslog I can see this error

 


NVRM: Attempting to remove minor device 2 with non-zero usage count!

Which indicates its bound somewhere, you havent bound it to a Docker container or something?

It seems to me its using the GT730 as the main GPU so its odd. You could try dumping the VBIOS (or download one from TechPowerUp and specify the VBIOS in your VM's configuration. 

 

Alternatively you could try and manually unbind it, and bind it to VFIO. 

 

SSH into your UnRaid and something like this:

 


echo 0000:43:00.0 > /sys/bus/pci/drivers/nvidia/unbind
echo 0000:43:00.1 > /sys/bus/pci/drivers/nvidia/unbind
echo 0000:43:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
echo 0000:43:00.1 > /sys/bus/pci/drivers/vfio-pci/bind

 

I don't have it bound to a docker. It was in use in Folding at Home docker but I removed the UUID from the docker. It shouldn't be in use.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, xl3b4n0nx said:

In advanced for me I have the North bridge settings. It has IOMMU and SR-IOV, both of which are enabled. I did find the Memory Interleave in AMD CBS.

For me IVRS is under Advanced\PBS

Link to comment
Share on other sites

Link to post
Share on other sites

14 hours ago, Windows7ge said:

For me IVRS is under Advanced\PBS

There are no adjustable settings there for me. Everything is under CBS. 

Link to comment
Share on other sites

Link to post
Share on other sites

15 hours ago, Jarsky said:

Just looking through your logs, your IOMMU groups look fine

 

 

From your syslog I can see this error

 


NVRM: Attempting to remove minor device 2 with non-zero usage count!

Which indicates its bound somewhere, you havent bound it to a Docker container or something?

It seems to me its using the GT730 as the main GPU so its odd. You could try dumping the VBIOS (or download one from TechPowerUp and specify the VBIOS in your VM's configuration. 

 

Alternatively you could try and manually unbind it, and bind it to VFIO. 

 

SSH into your UnRaid and something like this:

 


echo 0000:43:00.0 > /sys/bus/pci/drivers/nvidia/unbind
echo 0000:43:00.1 > /sys/bus/pci/drivers/nvidia/unbind
echo 0000:43:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
echo 0000:43:00.1 > /sys/bus/pci/drivers/vfio-pci/bind

 

I tried the unbind commands and it ended up locking up the nvidia cards. I don't understand what is grabbing them. Is there a way to determine where they are being used?

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, xl3b4n0nx said:

There are no adjustable settings there for me. Everything is under CBS. 

Perhaps it's not applicable to the 1920X. Jarsky seems to have a better lead on the issue anyhow and I don't know anything about UnRAID.

 

Sorry, but best of luck! :D

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, xl3b4n0nx said:

I tried the unbind commands and it ended up locking up the nvidia cards. I don't understand what is grabbing them. Is there a way to determine where they are being used?

Not completely sure, ive done limited troubleshooting with UnRAID. I'm almost certain its a binding issue though. You might be best to post the problem on UnRAID support forums as theyre far more experienced with issues like this, and their devs are often there to help as well. 

Spoiler

Desktop: Ryzen9 5950X | ASUS ROG Crosshair VIII Hero (Wifi) | EVGA RTX 3080Ti FTW3 | 32GB (2x16GB) Corsair Dominator Platinum RGB Pro 3600Mhz | EKWB EK-AIO 360D-RGB | EKWB EK-Vardar RGB Fans | 1TB Samsung 980 Pro, 4TB Samsung 980 Pro | Corsair 5000D Airflow | Corsair HX850 Platinum PSU | Asus ROG 42" OLED PG42UQ + LG 32" 32GK850G Monitor | Roccat Vulcan TKL Pro Keyboard | Logitech G Pro X Superlight  | MicroLab Solo 7C Speakers | Audio-Technica ATH-M50xBT2 LE Headphones | TC-Helicon GoXLR | Audio-Technica AT2035 | LTT Desk Mat | XBOX-X Controller | Windows 11 Pro

 

Spoiler

Server: Fractal Design Define R6 | Ryzen 3950x | ASRock X570 Taichi | EVGA GTX1070 FTW | 64GB (4x16GB) Corsair Vengeance LPX 3000Mhz | Corsair RM850v2 PSU | Fractal S36 Triple AIO | 12 x 8TB HGST Ultrastar He10 (WD Whitelabel) | 500GB Aorus Gen4 NVMe | 2 x 2TB Samsung 970 Evo Plus NVMe | LSI 9211-8i HBA

 

Link to comment
Share on other sites

Link to post
Share on other sites

On 2/10/2020 at 1:28 PM, Jarsky said:

Not completely sure, ive done limited troubleshooting with UnRAID. I'm almost certain its a binding issue though. You might be best to post the problem on UnRAID support forums as theyre far more experienced with issues like this, and their devs are often there to help as well. 

I agree with you there. I have recieved much more help here than on the unraid forums. Do you know what makes linux bind GPUs? Two of mine shouldn't be in use at all, but they are bound. 

Link to comment
Share on other sites

Link to post
Share on other sites

On 2/9/2020 at 7:49 PM, Jarsky said:

Just looking through your logs, your IOMMU groups look fine

 

 

From your syslog I can see this error

 


NVRM: Attempting to remove minor device 2 with non-zero usage count!

Which indicates its bound somewhere, you havent bound it to a Docker container or something?

It seems to me its using the GT730 as the main GPU so its odd. You could try dumping the VBIOS (or download one from TechPowerUp and specify the VBIOS in your VM's configuration. 

 

Alternatively you could try and manually unbind it, and bind it to VFIO. 

 

SSH into your UnRaid and something like this:

 


echo 0000:43:00.0 > /sys/bus/pci/drivers/nvidia/unbind
echo 0000:43:00.1 > /sys/bus/pci/drivers/nvidia/unbind
echo 0000:43:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
echo 0000:43:00.1 > /sys/bus/pci/drivers/vfio-pci/bind

 

So @Jarsky I have found that all 3 GPUs in my system have the "kernel driver in use: nvidia" line. If I try to unbind any of them the terminal hangs and the process can;t be killed. Is there a way to prevent the auto grabbing on boot up?

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, xl3b4n0nx said:

So @Jarsky I have found that all 3 GPUs in my system have the "kernel driver in use: nvidia" line. If I try to unbind any of them the terminal hangs and the process can;t be killed. Is there a way to prevent the auto grabbing on boot up?

Im not sure on how you can do that. My understanding is that dumping the VBIOS and adding that to your VM's configuration takes care of this. When the VM loads it unbinds the card, and uses the dumped VBIOS to make it compatible with the VM. It's a bit  beyond me why it's hard locking the system. 

 

Does journalctl reveal any errors during the lock that might indicate why it's happening? Run 'journalctl -xe' and see what errors are there. 

Spoiler

Desktop: Ryzen9 5950X | ASUS ROG Crosshair VIII Hero (Wifi) | EVGA RTX 3080Ti FTW3 | 32GB (2x16GB) Corsair Dominator Platinum RGB Pro 3600Mhz | EKWB EK-AIO 360D-RGB | EKWB EK-Vardar RGB Fans | 1TB Samsung 980 Pro, 4TB Samsung 980 Pro | Corsair 5000D Airflow | Corsair HX850 Platinum PSU | Asus ROG 42" OLED PG42UQ + LG 32" 32GK850G Monitor | Roccat Vulcan TKL Pro Keyboard | Logitech G Pro X Superlight  | MicroLab Solo 7C Speakers | Audio-Technica ATH-M50xBT2 LE Headphones | TC-Helicon GoXLR | Audio-Technica AT2035 | LTT Desk Mat | XBOX-X Controller | Windows 11 Pro

 

Spoiler

Server: Fractal Design Define R6 | Ryzen 3950x | ASRock X570 Taichi | EVGA GTX1070 FTW | 64GB (4x16GB) Corsair Vengeance LPX 3000Mhz | Corsair RM850v2 PSU | Fractal S36 Triple AIO | 12 x 8TB HGST Ultrastar He10 (WD Whitelabel) | 500GB Aorus Gen4 NVMe | 2 x 2TB Samsung 970 Evo Plus NVMe | LSI 9211-8i HBA

 

Link to comment
Share on other sites

Link to post
Share on other sites

  • 1 month later...

I am having basically the identical issue, have you had any luck? 

 

Looks like the card I want to passthrough is also getting set as the main GPU.  I will try downloading a bios file for it and see what that does.  The completely locking up of the system is pretty awful when trying to troubleshoot.

Link to comment
Share on other sites

Link to post
Share on other sites

14 hours ago, Beaniiman said:

I am having basically the identical issue, have you had any luck? 

 

Looks like the card I want to passthrough is also getting set as the main GPU.  I will try downloading a bios file for it and see what that does.  The completely locking up of the system is pretty awful when trying to troubleshoot.

What I ended up having to do was force Unraid to not initialize the GPU on boot up by adding
 

append vfio-pci.ids=10de:13bb,10de:0fbc,10de:1287,10de:0e0f,10de:0ffe,10de:0e1b 

into the syslinux configuration for the boot USB. What that does is force the vfio driver to grab the GPUs with those IDs. If you run 'lspci -v' you will see ID numbers similar to those for your system.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×