Jump to content

Dell R720 ESXi GPU Causing Fatal Server Crashes

Go to solution Solved by BloodKnight7,
On 4/25/2020 at 6:39 AM, SgtKilgore406 said:

@BloodKnight7

 

No, I used the official ESXi 6.7  image from VMware. The firmware, BIOS, and IDRAC are all up to date. I have 2 E5-2630 (version 1) CPUs installed which are rated at 95W. I am only trying to install one GPU. Looking at what is provided with a GPU enablement kit LINK I only need the power cable for the GPU. If power is the issue I'm surprised Dell didn't put in an error message to reference a "bus fatal error power draw limit exceeded" or something like that.

 

The image below (attached) shows the card with the riser power adapter. Note that I have tried the card in both that slot and the top middle slot with the same result.

 

r720-gpu-issue.thumb.jpg.83146e687fb34289486d54636fb28d2b.jpg

You should always try to use the official Dell image, you can find it in the support.dell.com site under your poweredge model. Normally those images contain the most stable drivers for the hardware of that generation of servers.

 

Regarding the card, my guess is that since those cards are not in the official supported models...there were no need for such specific messages. But I cant be sure.

I am having a problem with my PowerEdge R720 and do not have an idea why. I purchased the riser cable/adapter to power a GPU and have put an AMD R9 280 in. I can start the server perfectly fine and ESXi sees the GPU listed. Everything works until I try to power up a VM that has the card attached. The server ends up having hardware errors shown below. I have tried to move the GPU to a different slot but the issue is still present. For verification, I put an AMD HD6850 in the server and had zero problems including booting up a VM with that card passed through. I'm not sure if its a power issue or not. The server has dual 1100w PSUs so I wouldn't think power is the problem.

 

1.PNG.bb9007c9925d0d852f643e3da3ba883c.PNG

 

2.thumb.PNG.afd729d9b5acfd7258c20644264d28c3.PNG

Link to comment
Share on other sites

Link to post
Share on other sites

Couple of questions.. did you installed ESXi with the Dell image?

Have you updated, firmware, bios and drac of the server?

 

Also have you seen the following? GPU restrictions for R720:

-Requires 2 CPUs
-CPU must be 95W or less
-Max of two double wide GPGPU (rule that they take up 2 slots)
-Max of four single wide GPGPU
-All GPUs must be same type / model
-GPU requires redundant 1100W PSU and GPU enablement kit
-Two double-wide GPU requires optional riser 3
-Four single wide GPU cannot have optional riser 3
-TBU not supported

 

Im not sure I am being 100% accurate but from a quick google check your Radeon seems to consume 200W... the slot from the serfver willl give you...75 watts if I remember correctly, Im not an expert in consumer graphic cards, so... does your graphic card has any connector going into it to make up for the difference in power?

Link to comment
Share on other sites

Link to post
Share on other sites

@BloodKnight7

 

No, I used the official ESXi 6.7  image from VMware. The firmware, BIOS, and IDRAC are all up to date. I have 2 E5-2630 (version 1) CPUs installed which are rated at 95W. I am only trying to install one GPU. Looking at what is provided with a GPU enablement kit LINK I only need the power cable for the GPU. If power is the issue I'm surprised Dell didn't put in an error message to reference a "bus fatal error power draw limit exceeded" or something like that.

 

The image below (attached) shows the card with the riser power adapter. Note that I have tried the card in both that slot and the top middle slot with the same result.

 

r720-gpu-issue.thumb.jpg.83146e687fb34289486d54636fb28d2b.jpg

Link to comment
Share on other sites

Link to post
Share on other sites

On 4/25/2020 at 6:39 AM, SgtKilgore406 said:

@BloodKnight7

 

No, I used the official ESXi 6.7  image from VMware. The firmware, BIOS, and IDRAC are all up to date. I have 2 E5-2630 (version 1) CPUs installed which are rated at 95W. I am only trying to install one GPU. Looking at what is provided with a GPU enablement kit LINK I only need the power cable for the GPU. If power is the issue I'm surprised Dell didn't put in an error message to reference a "bus fatal error power draw limit exceeded" or something like that.

 

The image below (attached) shows the card with the riser power adapter. Note that I have tried the card in both that slot and the top middle slot with the same result.

 

r720-gpu-issue.thumb.jpg.83146e687fb34289486d54636fb28d2b.jpg

You should always try to use the official Dell image, you can find it in the support.dell.com site under your poweredge model. Normally those images contain the most stable drivers for the hardware of that generation of servers.

 

Regarding the card, my guess is that since those cards are not in the official supported models...there were no need for such specific messages. But I cant be sure.

Link to comment
Share on other sites

Link to post
Share on other sites

@BloodKnight7

 

The only reason I don't use the official image for my model is because the latest version they have is 6.5 U3 and I'm running 6.7. I've kind of started looking at possibly migrating to Proxmox instead so that I can use live migration and high availability. Regarding the graphics card, I think it probably is trying to pull too much power and crashing the server. I'm currently looking at the Nvidia Grid K2 cards on eBay as a better supported replacement.

Link to comment
Share on other sites

Link to post
Share on other sites

On 5/1/2020 at 2:08 PM, SgtKilgore406 said:

@BloodKnight7

 

The only reason I don't use the official image for my model is because the latest version they have is 6.5 U3 and I'm running 6.7. I've kind of started looking at possibly migrating to Proxmox instead so that I can use live migration and high availability. Regarding the graphics card, I think it probably is trying to pull too much power and crashing the server. I'm currently looking at the Nvidia Grid K2 cards on eBay as a better supported replacement.

Just in case you want to try it im 100% sure that one works in the R720:

https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=ynpj0

 

This is the latest 6.7u03 but I havent tested it in the R720:

https://www.dell.com/support/home/us/en/04/drivers/driversdetails?driverid=j75ny

Link to comment
Share on other sites

Link to post
Share on other sites

@BloodKnight7

 

I wanted to give a final update in regard to using the official Dell images. I installed the Dell 6.5 U3 image and the stability with the AMD GPU was exceptionally better. I was able to install the GPU, add to VM, install drivers in VM, etc all without issue. However, I ran into an issue with trying to get it to work with Folding@Home in the VM. I could not get F@H to find the GPU and add it which was the main purpose for wanting to add the GPU in the first place. I have decided to focus on upgrading CPUs and RAM in that server only. I have a Rosewill 4U mining case on the way and will work on building out a dedicated F@H box and forgoing virtualization. I really appreciated the help with troubleshooting.

 

Thanks!

SgtKilgore406

P.S.

 

I have now since switched to Proxmox and am already enjoying the extra free features that would have cost real $$$ from VMware.

Link to comment
Share on other sites

Link to post
Share on other sites

33 minutes ago, SgtKilgore406 said:

@BloodKnight7

 

I wanted to give a final update in regard to using the official Dell images. I installed the Dell 6.5 U3 image and the stability with the AMD GPU was exceptionally better. I was able to install the GPU, add to VM, install drivers in VM, etc all without issue. However, I ran into an issue with trying to get it to work with Folding@Home in the VM. I could not get F@H to find the GPU and add it which was the main purpose for wanting to add the GPU in the first place. I have decided to focus on upgrading CPUs and RAM in that server only. I have a Rosewill 4U mining case on the way and will work on building out a dedicated F@H box and forgoing virtualization. I really appreciated the help with troubleshooting.

 

Thanks!

SgtKilgore406

P.S.

 

I have now since switched to Proxmox and am already enjoying the extra free features that would have cost real $$$ from VMware.

Thanks for the update! my best wishes to you in your new projects :) if you need help with Dell stuff just let me know.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×