Jump to content

Supermicro H12SSL-i Motherboard 4x3090s LLM training cluster (All 4 cards together can't handle 100% load and 100% power draw)

In my training cluster, which is an Ubuntu 24.04, with an AMD EPYC 7282 and a supermicro H12SSL-i motherboard, 4TB NVME, 256GB RAM, I have installed 4 GPUs. All of them are variants of the RTX 3090 by NVIDIA (Aorus, Zotac and Gigabyte).

 

- A Gigabyte XTREME AORUS RTX 3090 - max power draw of 420W. Connected to PCIE Slot 6. CUDA Device 0.

- A gigabyte black RTX 3090 - max power draw of 370W. Connected to PCIE Slot 1. CUDA Device 2.

- A gigabyte white RTX 3090 - max power draw of 370W. Connected to PCIE Slot 7. CUDA Device 1.

- A Zotac RTX 3090 - max power draw of 350W. Connected to PCIE Slot 3. CUDA Device 3.

 

I have 2, 1000W power supplies set up for this system. My power supply 1, supplies GPUs 0 and 1 only. One 8 pin connector connects this to the motherboard. My power supply 2, supplies the motherboard, CPU and GPUs 2 and 3. One 8 pin connector connects this to the motherboard, along with the general ATX power connector. My CPU has a max TDP of 280W.

I am trying to run some stress tests (https://github.com/wilicc/gpu-burn). However, curiously, some of these are failing. I am trying to understand why.

 

Experiments Run:

1. I ran a stress test on all GPUs individually. All of these passed, with no errors.

2. I ran a stress test on sets of 2 GPUs at a time (all combinations). These also passed, with no errors.

This means, tests on

CUDA GPUS 0 and 1 work perfectly

CUDA GPUS 1 and 2 work perfectly

CUDA GPUS 2 and 3 work perfectly

CUDA GPUS 0 and 3 work perfectly

etc

Now, I started to run stress tests on 3 gpus together. Curiously, there is only one combination that failed.

- When I ran the stress test on CUDA GPUS 0, 1 and 2, the test worked perfectly.

- When I ran the stress test on CUDA GPUS 1, 2 and 3, the test worked perfectly.

- When I ran the stress test on CUDA GPUS 0, 2 and 3, the test worked perfectly.

 

However:

 

- When I ran the stress test on CUDA GPUS 0, 1 and 3, the test failed, and the system crashed and rebooted.

When I limit the power draw to 100W for all GPUs, then it works fine when I run all 4 together. However, when I raise it to 150W, it fails. This doesn't make sense, because when I run the test on the unlimited 0, 1, 2 combination, the test works perfectly.

Mainly, I'm curious, because the combination of 0, 1, 2 works, which splits the power supply, but 0, 1, 3 doesn't, even though the power setup is nearly identical. Do I need to change where my GPUs are slotted in? Do I need to get more power supplies? Is there an issue with the power draw from my commercial apartment? 

Link to comment
Share on other sites

Link to post
Share on other sites

The motherboard and cpu can consume up to 200-250 watts alone.

 

Get a power meter - random amazon search result https://www.amazon.com/Electricity-Monitor-Voltage-Overload-Protection/dp/B07DPJ3RGB/  - and measure the power consumption on each power supply. Assume around 90% efficiency, so the psu will pull around 1100 watts if it gives 1000w to components.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

i'm gonna hazard a guess that you're running into oddities running the cpu spread between both PSU's, without knowing the exact electronical layout of the PSU's and motherboard it's already quite feasible that that's 200 watts that can sway between both 1000W power supplies depending on exact conditions.

 

running two ATX power supplies together is already a bit of a hacky dark art, the fact you're doing it with such a powerful system doesnt exactly help your situation.. and in all of that you're trying to run both power supplies pretty close to their rated power draw.

 

i'd say get a 1600W power supply to replace one of the 1000w ones, use that one for two GPU's, CPU, and motherboard power. then use a 1000w for the other two GPU's.

 

on that note.. if you're running this in an appartment.. depending on if you're in 110 or 220v land, and what your local standards for wiring are.. you might be kicking these power supplies out of acceptable AC voltage range.

Link to comment
Share on other sites

Link to post
Share on other sites

Earlier this week it was someone here that had issues with his quad GPU setup too.. but he used 2 1650w PSU's with a 24 pin inbetween cable so both where controlled, 2 GPU's on each and 24pin on one and 8 pin eps on the other to share the load. 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×