Jump to content

2x3090 machine locks up under specific loads

whipperpipper
Go to solution Solved by igormp,

Could you try power limiting those 3090 to 300 or 250W and see what happens?

I have a machine learning rig that's been running just fine for a few months.  But a recent project (training a diffusion model) causes it to reliably lose power after a minute of running.  SSH terminates, power drops to < 10 watts, no display output, GPU RGB turns off, all fans turn off, the power button becomes unresponsive (neither pressing nor holding has any effect), but, oddly enough, the RAM RGB keeps going just fine.  Only power cycling the PSU brings the machine back up.  Of note, the GPU RGB is driven solely by the motherboard.  So it isn't just the GPUs shutting off or something.  Either the motherboard or the PSU seems to drop power to almost everything (except, clearly, the RAM...).

 

Other workloads didn't cause this problem.  I've run loads on it that push the GPUs harder (more Watts) than this one without issue.  This load is <700W total, according to the UPS.  I've tried the latest drivers, different CUDA versions, etc.  The same workload runs fine if I use only a single GPU, but using both GPUs causes the power loss after a minute.

 

I figured either a bad PSU or PSU OCP triggering, even though load is <700W and higher loads ran fine.  Either way, I bought a new Corsair HX1200 but no improvement.

 

  • Ubuntu 22.04
  • EVGA FTW3 3090
  • NVidia driver 470.141.03 (tested 515.65.01)
  • Corsair HX1200 (RM1000x previously)
  • 5900X
  • ASUS AM4 TUF Gaming X570-Plus (Wi-Fi)
  • BIOS 4204 (latest working)

 

 

Thank you!

Link to comment
Share on other sites

Link to post
Share on other sites

OCP triggering would be a hard shutdown, not the machine locking up and going to idle state.

Link to comment
Share on other sites

Link to post
Share on other sites

Have you tried running the training with one 3090 removed? That will at least confirm its not a power supply issue. 

Link to comment
Share on other sites

Link to post
Share on other sites

25 minutes ago, whipperpipper said:

I have a machine learning rig that's been running just fine for a few months.  But a recent project (training a diffusion model) causes it to reliably lose power after a minute of running.  SSH terminates, power drops to < 10 watts, no display output, GPU RGB turns off, all fans turn off, the power button becomes unresponsive (neither pressing nor holding has any effect), but, oddly enough, the RAM RGB keeps going just fine.  Only power cycling the PSU brings the machine back up.  Of note, the GPU RGB is driven solely by the motherboard.  So it isn't just the GPUs shutting off or something.  Either the motherboard or the PSU seems to drop power to almost everything (except, clearly, the RAM...).

 

Other workloads didn't cause this problem.  I've run loads on it that push the GPUs harder (more Watts) than this one without issue.  This load is <700W total, according to the UPS.  I've tried the latest drivers, different CUDA versions, etc.  The same workload runs fine if I use only a single GPU, but using both GPUs causes the power loss after a minute.

 

I figured either a bad PSU or PSU OCP triggering, even though load is <700W and higher loads ran fine.  Either way, I bought a new Corsair HX1200 but no improvement.

 

  • Ubuntu 22.04
  • EVGA FTW3 3090
  • NVidia driver 470.141.03 (tested 515.65.01)
  • Corsair HX1200 (RM1000x previously)
  • 5900X
  • ASUS AM4 TUF Gaming X570-Plus (Wi-Fi)
  • BIOS 4204 (latest working)

 

 

Thank you!

Have you tried ssh'ing into the machine after it supposedly locks up? Could be that the system is still running and the GPUs crashed.

 

Seems like idk how to read since you mentioned ssh dropping out right at the beggining lol

Can't you see anything under journalctl right after the crash?

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

21 minutes ago, DeltaBruggemann said:

Have you tried running the training with one 3090 removed? That will at least confirm its not a power supply issue. 

It runs just fine if I set it to train on only one of the GPUs.

Link to comment
Share on other sites

Link to post
Share on other sites

Could you try power limiting those 3090 to 300 or 250W and see what happens?

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, igormp said:

Could you try power limiting those 3090 to 300 or 250W and see what happens?

Will do, and I'll see if there are any logs left over after the "crash".

Link to comment
Share on other sites

Link to post
Share on other sites

It does sound like it could still be a power supply issue. Nvidia recommends double the constant max power of the 30 series GPU's for the power supply to keep up with peak loads. With dual 3090's you're looking at a 1400w + PSU. The peak load causing the issue might not show up on UPS load meter as it could be happening faster then the display/sample rate of the user facing Power Meter.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, igormp said:

Could you try power limiting those 3090 to 300 or 250W and see what happens?

That seems to have done the trick.  Going for 15 minutes now with no lockup. 👍

 

55 minutes ago, DeltaBruggemann said:

It does sound like it could still be a power supply issue. Nvidia recommends double the constant max power of the 30 series GPU's for the power supply to keep up with peak loads. With dual 3090's you're looking at a 1400w + PSU. The peak load causing the issue might not show up on UPS load meter as it could be happening faster then the display/sample rate of the user facing Power Meter.

Yeah that seems like what's happening.  Crazy that they can spike that high.

 

I'll grab a yet even larger PSU and see if that fully resolves the issue with power limits off.

 

Makes me excited for the 4000W PSU I'll need to run the 4000 series cards...

 

I'll mark as solved for now and report back about the new PSU later.  Thanks so much for the help everyone!

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, whipperpipper said:

That seems to have done the trick.  Going for 15 minutes now with no lockup. 👍

 

Yeah that seems like what's happening.  Crazy that they can spike that high.

 

I'll grab a yet even larger PSU and see if that fully resolves the issue with power limits off.

 

Makes me excited for the 4000W PSU I'll need to run the 4000 series cards...

 

I'll mark as solved for now and report back about the new PSU later.  Thanks so much for the help everyone!

 

I wouldn't even bother with a better PSU if power limiting them did the trick, the perf loss should be around ~10% anyway:

3090 MaxQ fp16

https://www.pugetsystems.com/labs/hpc/Quad-RTX3090-GPU-Wattage-Limited-MaxQ-TensorFlow-Performance-1974/

 

If you're already training stuff for an entire day, another two hours or so won't hurt and your electricity bill will be happier.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

On 8/14/2022 at 6:49 PM, igormp said:

I wouldn't even bother with a better PSU if power limiting them did the trick, the perf loss should be around ~10% anyway:

 

That's a fantastic point ... which I wish I had read before ordering a new PSU...

 

In light of that fact, at least I can report that a 1600W PSU lets the system run without power limiting now.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, whipperpipper said:

 

That's a fantastic point ... which I wish I had read before ordering a new PSU...

 

In light of that fact, at least I can report that a 1600W PSU lets the system run without power limiting now.

 

 

 

Well, you can always return it if you wish do to so.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×