Jump to content

ESXi Nvidia Card Passthrough - No Device Detected - Nvidia-SMI

Lurick

The problem:

Upon first boot of my system the card shows up under nvidia-smi, applications can use it, never an issue provided the VM does NOT reboot. The first reboot of the VM that has the GPU assigned and the issues start and the only "resolution" is to reboot the ESXi host.

Quadro P4000 doesn't show up under "nvidia-smi" output and is not usable for any application even though it shows up in lspci as a device. Issue also does not happen with a Quadro M2000 although I've got a different issue with that card (it will sometimes not be detected between host reboots but that's not much of an issue for me)

I have other PCIe cards (NICs) passed through to other VMs, no issues with those ever.

It seems that once the VM is rebooted I cannot use the card ever again until I reboot the physical host

 

Host System Specs:

Mobo: Gigabyte C246-WU4

CPU: Intel Xeon E-2278G

Memory: 128GB DDR4 ECC memory

OS: ESXi 8.0a [just upgraded from 7.0u3i]

 

What I've tried/additional info:

ESXi 7.0u3i and 8.0a - Same issue across both, fully patched (no change)

Persists across VMs but the card still shows up in ESXi as Passthru and in lspci on the VMs

I've tried clean/fresh installs of VMs (no change)

I've removed it from the VM, toggled passthru off and back on, readded the card to the VM (no change)

Passing the card through as DirectPath IO and Dynamic DirectPath IO (no change)

I've disabled the svga adapter and made other parameter modifications (no change)

I've tried BIOS and EFI on multiple VMs (no change)

Tried different Nvidia drivers, rolling back, etc. (no change)

 

Debug output:

~$ sudo dmesg | grep NVRM 
[   16.415298] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  525.78.01  Mon Dec 26 05:58:42 UTC 2022
[   16.899405] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x23:0xffff:1413)
[   16.899778] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[   27.781972] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x23:0xffff:1413)
[   27.782157] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
[   27.911073] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x23:0xffff:1413)
[   27.911477] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0

~$ sudo nvidia-smi
No devices were found

~$ lspci
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 01)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 01)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:07.7 System peripheral: VMware Virtual Machine Communication Interface (rev 10)
00:0f.0 VGA compatible controller: VMware SVGA II Adapter
00:10.0 SCSI storage controller: Broadcom / LSI 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 01)
00:11.0 PCI bridge: VMware PCI bridge (rev 02)
00:15.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:15.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:16.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:17.7 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.0 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.1 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.2 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.3 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.4 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.5 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.6 PCI bridge: VMware PCI Express Root Port (rev 01)
00:18.7 PCI bridge: VMware PCI Express Root Port (rev 01)
02:01.0 SATA controller: VMware SATA AHCI controller
03:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
0b:00.0 VGA compatible controller: NVIDIA Corporation GP104GL [Quadro P4000] (rev a1)

 

Expectations:

I suspect the card might be bad or dead/dying but I'd really like to know for sure before I spend any money on a replacement. Not 100% sure where to check in ESXi for logs that might help narrow down this problem either.

Current Network Layout:

Current Build Log/PC:

Prior Build Log/PC:

Link to comment
Share on other sites

Link to post
Share on other sites

What are the OSes being used?

Not a pro, not even very good.  I’m just old and have time currently.  Assuming I know a lot about computers can be a mistake.

 

Life is like a bowl of chocolates: there are all these little crinkly paper cups everywhere.

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, Bombastinator said:

What are the OSes being used?

Sorry, Ubuntu 20.04 and 22.04 for the VMs

Current Network Layout:

Current Build Log/PC:

Prior Build Log/PC:

Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, Lurick said:

Sorry, Ubuntu 20.04 and 22.04 for the VMs

Nvidia famously doesn’t play super well with Linux as they refuse to release anything but binaries to anyone.  A major reason apple won’t work with them any more.  They used to use mostly Nvidia cards but got burned too badly on one of their laptops.

Not a pro, not even very good.  I’m just old and have time currently.  Assuming I know a lot about computers can be a mistake.

 

Life is like a bowl of chocolates: there are all these little crinkly paper cups everywhere.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Bombastinator said:

Nvidia famously doesn’t play super well with Linux as they refuse to release anything but binaries to anyone.  A major reason apple won’t work with them any more.  They used to use mostly Nvidia cards but got burned too badly on one of their laptops.

The thing is the Nvidia M2000 works without issue after reboots of the VMs.

Current Network Layout:

Current Build Log/PC:

Prior Build Log/PC:

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Lurick said:

The thing is the Nvidia M2000 works without issue after reboots of the VMs.

That would be in keeping.  The standard move seems to be adding another layer after the hardware so there is something to manipulate for crap like this.  One has to guess at what one is manipulating though.  Yes it’s dumb.  Everyone seems to think so.  Makes people extremely suspicious of Nvidia such that if people actually saw the code they’d go “but that’s bullshit!” or something.  If Nvidia didn’t have a lock on the CUDA market I suspect no one would use them because of that.

Edited by Bombastinator

Not a pro, not even very good.  I’m just old and have time currently.  Assuming I know a lot about computers can be a mistake.

 

Life is like a bowl of chocolates: there are all these little crinkly paper cups everywhere.

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, Bombastinator said:

Nvidia famously doesn’t play super well with Linux as they refuse to release anything but binaries to anyone.  A major reason apple won’t work with them any more.  They used to use mostly Nvidia cards but got burned too badly on one of their laptops.

Nvidia GPU support works perfectly fine in Linux. This is purely and ONLY a Linux community mindset complaint because they want and demand everything to be open source or user compliable. Fact is they and everyone simply has to get over this as not every company will or should have to do this and are within their right to supply their product however they like under the terms you agreed to when buying it.

 

Point being I've never had issues getting supported Nvidia GPUs working in Linux bare metal or with VMs. The drivers can be a bit annoying to get working in the past but that's really not a problem for a long time now. Also it is/was an RTFM problem and nothing more. Nvidia GPUs and drivers literally have worked fine since the AGP era.

 

@Lurick

Have you added these to your advanced VM configuration settings?

pciPassthru.64bitMMIOSizeGB 256
pciPassthru.use64bitMMIO TRUE

 

I think form memory the P4000 isn't "supported" for PCIe passthrough so this could be a device firmware lockup issue or something like that. It's more likely the above VM settings though if you don't have them.

Link to comment
Share on other sites

Link to post
Share on other sites

14 minutes ago, leadeater said:

Nvidia GPU support works perfectly fine in Linux. This is purely and ONLY a Linux community mindset complaint because they want and demand everything to be open source or user compliable. Fact is they and everyone simply has to get over this as not every company will or should have to do this and are within their right to supply their product however they like under the terms you agreed to when buying it.

 

Point being I've never had issues getting supported Nvidia GPUs working in Linux bare metal or with VMs. The drivers can be a bit annoying to get working in the past but that's really not a problem for a long time now. Also it is/was an RTFM problem and nothing more. Nvidia GPUs and drivers literally have worked fine since the AGP era.

 

@Lurick

Have you added these to your advanced VM configuration settings?

pciPassthru.64bitMMIOSizeGB 256
pciPassthru.use64bitMMIO TRUE

 

I think form memory the P4000 isn't "supported" for PCIe passthrough so this could be a device firmware lockup issue or something like that. It's more likely the above VM settings though if you don't have them.

It pissed off apple enough that they chucked them as a vendor though so not just the Linux community.    I’m not saying they don’t work. Just that they tend to be avoided unless there is no choice (which not infrequently is the case) and if there is a problem there aren’t options like there are with other cards

Edited by Bombastinator

Not a pro, not even very good.  I’m just old and have time currently.  Assuming I know a lot about computers can be a mistake.

 

Life is like a bowl of chocolates: there are all these little crinkly paper cups everywhere.

Link to comment
Share on other sites

Link to post
Share on other sites

42 minutes ago, Bombastinator said:

It pissed off apple enough that they chucked them as a vendor though so not just the Linux community.

That's not what annoyed Apple, what annoyed Apple was getting the blame for Nvidia hardware issues and device failure because of it and the massive support cost and problems to rectify it all while Nvidia refused to acknowledge and take responsibility for it.

 

Again open source and user compiling is a user community complaint and only that. You really think Apple had problems packaging and deploying binaries? You really think Apple didn't have license and contract to the driver code?

 

Please ignore the rhetoric of the Linux community when it comes to Nvidia, their complaints are very distorted.

 

42 minutes ago, Bombastinator said:

 I’m not saying they don’t work. Just that they tend to be avoided unless there is no choice (which not infrequently is the case) and if there is a problem there aren’t options like there are with other cards

In the professional and scientific world no Nvidia is not "avoided", it's what everyone wants and needs. They are so wanted and needed no other option is even remotely considered and many don't even know there are other possible options because in reality those are not options because they cannot sufficiently provide what is needed.

 

As much as I like AMD GPUs they simply are not an option. That will change now that some huge government level super computer contracts are going to use AMD GPUs but there is 10+ years of ground to be made up and that's not going to happen in 1 year.

Edited by leadeater
Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, leadeater said:

That's not what annoyed Apple, what annoyed Apple has getting the blame for Nvidia hardware issues and device failure because of it and the massive support cost and problems to rectify it all while Nvidia refused to acknowledge and take responsibility for it.

 

In the professional and scientific world no Nvidia is not "avoided", it's what everyone wants and needs. They are so wanted and needed no other option is even remotely considered and many don't even know there are other possible options because in reality those are not options because they cannot sufficiently provide what is needed.

 

As much as I like AMD GPUs they simply are not an option. That will change now that some huge government level super computer contracts are going to use AMD GPUs but there is 10+ years of ground to be made up and that's not going to happen in 1 year.

The one lead to the other though 

Not a pro, not even very good.  I’m just old and have time currently.  Assuming I know a lot about computers can be a mistake.

 

Life is like a bowl of chocolates: there are all these little crinkly paper cups everywhere.

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, Bombastinator said:

The one lead to the other though 

No it didn't. Nvidia hardware faults have nothing to do with drivers and code. Nvidia failing to provide the support as a business Apple wanted during that issue has nothing to do with that at all. That's it. This whole thing is mythical crap by Linux users and I refuse to talk about it more. It's nonsense rubbish and I loath it comes up every damn time because Apple is not a "Linux user" and zero consideration is taken in to the difference of the likes of Apple, a massive business who can negotiate code access and likely had it, and them.

 

There is one reason and one alone that annoyed Apple. Failing hardware they had to bare the cost and blame for that Nvidia refused to help with.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, leadeater said:

No it didn't. Nvidia hardware faults have nothing to do with drivers and code. Nvidia failing to provide the support as a business Apple wanted during that issue has nothing to do with that at all. That's it. This whole thing is mythical crap by Linux users and I refuse to talk about it more. It's nonsense rubbish and I loath it comes up every damn time because Apple is not a "Linux user" and zero consideration is taken in to the difference of the likes of Apple, a massive business who can negotiate code access and likely had it, and them.

 

There is one reason and one alone that annoyed Apple. Failing hardware they had to bare the cost and blame for that Nvidia refused to help with.

I was talking about the apple but.

Not a pro, not even very good.  I’m just old and have time currently.  Assuming I know a lot about computers can be a mistake.

 

Life is like a bowl of chocolates: there are all these little crinkly paper cups everywhere.

Link to comment
Share on other sites

Link to post
Share on other sites

12 hours ago, leadeater said:

Nvidia GPU support works perfectly fine in Linux. This is purely and ONLY a Linux community mindset complaint because they want and demand everything to be open source or user compliable. Fact is they and everyone simply has to get over this as not every company will or should have to do this and are within their right to supply their product however they like under the terms you agreed to when buying it.

 

Point being I've never had issues getting supported Nvidia GPUs working in Linux bare metal or with VMs. The drivers can be a bit annoying to get working in the past but that's really not a problem for a long time now. Also it is/was an RTFM problem and nothing more. Nvidia GPUs and drivers literally have worked fine since the AGP era.

 

@Lurick

Have you added these to your advanced VM configuration settings?

pciPassthru.64bitMMIOSizeGB 256
pciPassthru.use64bitMMIO TRUE

 

I think form memory the P4000 isn't "supported" for PCIe passthrough so this could be a device firmware lockup issue or something like that. It's more likely the above VM settings though if you don't have them.

I have added the settings (although I did 32 instead of 256) and no change 😞

Current Network Layout:

Current Build Log/PC:

Prior Build Log/PC:

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, Lurick said:

I have added the settings (although I did 32 instead of 256) and no change 😞

Damn. Annoying thing is I'm sure I've experienced this before. Either with a GPU or a RAID card but I can't remember. Really the only thing I can thing of is the card isn't resetting properly during a VM reboot and locking up.

 

Question, if you shutdown the VM while the GPU is working correctly and power it on again does it still work? Is the problem only with VM reboot or with shutdown also?

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, leadeater said:

Damn. Annoying thing is I'm sure I've experienced this before. Either with a GPU or a RAID card but I can't remember. Really the only thing I can thing of is the card isn't resetting properly during a VM reboot and locking up.

 

Question, if you shutdown the VM while the GPU is working correctly and power it on again does it still work? Is the problem only with VM reboot or with shutdown also?

Good questions, now that the GPU is locked up or whatever I'd have to reboot the host which is rather a pain to do but I'll give it a shot later today. I'm pretty sure it would lock up but I'll test to be 100% sure.

Current Network Layout:

Current Build Log/PC:

Prior Build Log/PC:

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, Bombastinator said:

I was talking about the apple but.

I suggest you have a read through this as it explains very well what went on, what actually happened and why. It will become very clear drivers, binaries etc had zero to do with anything Apple v Nvidia.

 

https://blog.greggant.com/posts/2021/10/13/apple-vs-nvidia-what-happened.html

 

It wasn't just Apple impacted by the Nvidia hardware issue, other much larger OEMs at the time like Dell also had problems and threatened to remove all Nvidia products from everything which forced Nvidia in to providing compensation for the problem they knowingly caused. It's not like Nvidia didn't know they were shipping defective product.

 

Nvidia is the reason Apple made the business decision to become a silicon developer/manufacturer and bring everything possible in-house. From getting burned by bad product supply, inadequate support from the supplier, the supplier getting vindictive and trying to block their business by filing patient lawsuits against other manufactures that supplied them you can sure bet this pissed Apple off a lot.

 

What's really, really clear and obvious to anyone open minded enough to look in to the situation is that drivers had absolutely nothing to do with what got between Apple and Nvidia. The only reason Linux users try and roll that situation out is to try and bring legitimacy to their complaint and that's what annoys me about it, because it gets in the way of someone asking for help or enquiring about an issue and then this gets brought up which offers zero help and has nothing ever to do with whatever problem they are having.

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, leadeater said:

I suggest you have a read through this as it explains very well what went on, what actually happened and why. It will become very clear drivers, binaries etc had zero to do with anything Apple v Nvidia.

 

https://blog.greggant.com/posts/2021/10/13/apple-vs-nvidia-what-happened.html

 

It wasn't just Apple impacted by the Nvidia hardware issue, other much larger OEMs at the time like Dell also had problems and threatened to remove all Nvidia products from everything which forced Nvidia in to providing compensation for the problem they knowingly caused. It's not like Nvidia didn't know they were shipping defective product.

 

Nvidia is the reason Apple made the business decision to become a silicon developer/manufacturer and bring everything possible in-house. From getting burned by bad product supply, inadequate support from the supplier, the supplier getting vindictive and trying to block their business by filing patient lawsuits against other manufactures that supplied them you can sure bet this pissed Apple off a lot.

 

What's really, really clear and obvious to anyone open minded enough to look in to the situation is that drivers had absolutely nothing to do with what got between Apple and Nvidia. The only reason Linux users try and roll that situation out is to try and bring legitimacy to their complaint and that's what annoys me about it, because it gets in the way of someone asking for help or enquiring about an issue and then this gets brought up which offers zero help and has nothing ever to do with whatever problem they are having.

The conclusion is 180° opposite of what I read in another article, though the data is similar.  This seems to be a matter of view.  The claim there was similar, in that the GPUs were faulty and that Nvidia wouldn’t admit it.  Apple was unable to prove the GPUs themselves were faulty though (in which case they could simply sue Nvidia regardless of whether Nvidia chose to act or not) because to do so they would have to be able to look at the driver and Nvidia wouldn’t let them do that.  The reason apple gave was that Nvidia wouldn’t let them look at the driver.  The near refusal of Dell as well was not mentioned in that thing, and it is interesting that Nvidia would placate Dell but not apple. A situation where a company could deliver faulty product and prevent legal protections from such being used by not allowing the drivers to be viewed would make me for one stop using them. Especially if I lost massive amounts of money and had to take it on the chin because Nvidia did not choose to own up to their error in my case but not in the case of Dell.  
 

The whole driver thing would seem like a GREAT reason to not deal with them any more, for not apple, Linux, and indeed any maker not large enough to be able to threaten Nvidia without the help of the legal system because they are apparently using the whole binaries only thing to side step it.   It sodesnt matter whether Nvidia allows their product to work if not if they can ship trash with impunity.  The article you listed causes me as a whitebox builder to NEVER want to put an Nvidia gpu in my machine as they are using their binaries only thing to avoid the law and only react to their own bad products if they can be made to. I’m just one guy, even smaller than apple and if they’re not big enough I’m definitely not big enough.   Screw them right back.  I’ve previously been more or less agnostic about gpu brands, but this pushes me straight into the AMD camp.  That’s ugly.  And it’s straight up about the binaries.  So more nuanced, but the conclusion is the same.  Seems to me that the only reason Nvidia still exists is because Dell was big enough to break through their Chinese wall legal avoidance. It would seem that if a pc builder is smaller than Dell (which is ,well just about everyone) the Nvidia binary only system is a very bad thing.  It’s not that companies smaller than apple aren’t annoyed by it, it’s that they can’t afford not to put up with it.  As as an end consumer I can.  It was really stupid of Nvidia to use their binary system as a protection from legal action, and stupider still to  admit it to companies big enough to hurt them but not make whole companies that weren’t.  Companies smaller than Dell cannot afford to use Nvidia if they have a choice. 

Edited by Bombastinator

Not a pro, not even very good.  I’m just old and have time currently.  Assuming I know a lot about computers can be a mistake.

 

Life is like a bowl of chocolates: there are all these little crinkly paper cups everywhere.

Link to comment
Share on other sites

Link to post
Share on other sites

@leadeater

I hate to end it like this but I kind of gave up on trying to figure this one out. I got a P2000 as well which shows the exact same issues and leads me to believe something is up with either segmentation of the P vs M cards or there is a motherboard issue of some kind. I toss my M2000 back in and sure I have some minor issues on reboot where it won't show up every other reboot but beyond that it works for now so I'm just going to keep it like this until I get more time to dig deeper, having my pfsense router on the same box doesn't make life easier with reboots so the fewer the better, lol.

Current Network Layout:

Current Build Log/PC:

Prior Build Log/PC:

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×