Jump to content

Nvidia GRID issues

Franck

I am trying a long shot and asking here about an issue that has been troubling us for about 10-12 weeks and we had not luck finding much yet.

Note that I am aware that this it is very niche programming but that's the kind of programming I have been doing for over 2 decades.

 

So I have a Citrix server with VMWare and I think it is called vSphere but I am not 100% sure.

Anyway it run 4 x Nvidia A16 each have 4 nodes for a total of 16 nodes. Great so at least 16 assignable GPU with Nvidia GRID.

 

So I am focusing on a single simple VM with only windows installed fresh and a 3D CAD homemade app. It can run either DirectX or OpenGL for rendering.

Now with either at some point we get an error "The device is not ready".

The problem is that is at driver level as I cannot catch that error in anyway trough code.

 

I am still working with different support team on the issue and we tested lots of DirectX and OpenGL features and we have seen issues only on OpenGL side.

Although note that the message pops even if we run with the DirectX engine so it could be that OpenGL sees the issue but DirectX brushes over and fallback on something else

so it's features are "fully working" while it is also not working.

 

On OpenGL I have issues with shaders not initializing.

 

We tried :

installing new GRID drivers

- recreating the VM

- changing assigned cores

- changing assigned GRID node

- changing resolutions

 

Note that these VM are visually running trough Citrix Workspace so the "remote desktop" runs inside a web browser tab with what I believe is HLS streaming.

 

Any idea how to trouble shoot this issue or someone that had similar issue ?

 

I know that "The device is not ready" is a very very generic error and also windows does not log this into problem reports nor event viewers.

 

Link to comment
Share on other sites

Link to post
Share on other sites

NOT a programmer but my instinct is telling me that the VM setup you have for windows may want a more granular connection to the device in this case. that or something bios level for VM support is not aligned to allow for hardware bypass. 

 

 

The device is not ready = to me equals something higher priority is eating the resource and keeping it busy 

hope that helps 

Link to comment
Share on other sites

Link to post
Share on other sites

I have some limited experience with GRID & vSphere.

 

Is this a day 1 problem or did it crop up out of nowhere?

 

I'd try a different software for accessing the VM. Windows RDP or PARSEC. Something that will use the vGPU video encoder.

 

Otherwise make sure you're using a Q profile GRID 240Q 260Q etc. This dictates it will support NVENC.

 

Do you routinely log out of these VMs or do users leave their sessions open after exiting? Could be a contributor.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Windows7ge said:

I have some limited experience with GRID & vSphere.

 

Is this a day 1 problem or did it crop up out of nowhere?

 

I'd try a different software for accessing the VM. Windows RDP or PARSEC. Something that will use the vGPU video encoder.

 

Otherwise make sure you're using a Q profile GRID 240Q 260Q etc. This dictates it will support NVENC.

 

Do you routinely log out of these VMs or do users leave their sessions open after exiting? Could be a contributor.

it is a day 1 issue. Brand new server and trying to set this up for the first time.

I am trying to get access to the VM trough direct RDP for tests, still waiting for the bridge to be made since it's in a DMZ

Will try playing with the Q profile after the remote test.

And yes VM are routinely closed. I can't even count how many time I rebooted them for test purpose

 

Thanks for the inputs

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, Franck said:

it is a day 1 issue. Brand new server and trying to set this up for the first time.

I am trying to get access to the VM trough direct RDP for tests, still waiting for the bridge to be made since it's in a DMZ

Will try playing with the Q profile after the remote test.

And yes VM are routinely closed. I can't even count how many time I rebooted them for test purpose

 

Thanks for the inputs

I should have read a little deeper. You said A16. My brain defaulted to GRID K2. A much much older card.

 

With vSphere did you have to install vCenter to assign vGPUs to VMs or did you pass whole GPU dies via hardware pass-through? I'm seeing the A16 is a quad GPU card correct?

Link to comment
Share on other sites

Link to post
Share on other sites

13 hours ago, Windows7ge said:

With vSphere did you have to install vCenter to assign vGPUs to VMs or did you pass whole GPU dies via hardware pass-through? I'm seeing the A16 is a quad GPU card correct?

I don't know how they pass the gpu to the VM. I don't have access to it yet as I was only testing software wise if I could find the issue.

 

Yes each A16 has like 4 "sub cards" inside, it has 16 gb per "sub card" so 64 gb total on a single physicals card.

Each of the 4 "sub card" can be split down to 2 gb VGPU* but i do not know if they can be split unequal like 2gb+6gb+8gb for 3 VGPU ?

right now we have it split equally into 4 x 4gb so we get 4 VGPU per "sub card" therefore we do 16 VGPU on a single card.

 

* Although NVIDIA datasheet says it can provide 64 vms with a single card and that would mean it can be split into 1gb VGPU which is contradicting other spec sheets saying min 2 gb.

 

We have 4 of those cards so total 64 VGPU in theory for us. I know we haven't configured them all yet as I know I will eventually have 8-10 vm with 8gb VGPU so the end number will most likely be more around 30 assignable VGPU with all 4 cards.

 

I am still waiting from the actual IT team as I sent them an email to give me access again to the admin tools so I can check if I find something similar to what you said and play with settings a little bit more. I have pretty much exhausted everything I can do software to find the issue

 

Link to comment
Share on other sites

Link to post
Share on other sites

  • 2 weeks later...

Finally got my hand on Remote Desktop Access for the VM and I have a lot more issue that Citrix workspace seems to just bypass and screwup my software. My code now hit try catch perfectly. So Citrix is as bad as it used to be 20 years ago.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×