Jump to content

AMD CONSUMER CLUSTER | HELP NEEDED

SO i have started doing deep learning work in mah free time along with a friend.
i have 10,000$ budget(i love this work so i want to invest locally)
i also live in india, as 4090 cost 2300$ and 7900xtx 1200$ lowest deal i could find
i have a single 4090 which have 80+ tflops fp16 as it is stupid 1:1 of fp32 on Nvidia
while 7900xtx have 123+ fp16 thanks to 2:1 of fp32 and literally cost half
i havent seen anyone using AMD consumer card cluster except maybe George hotz's "tinybox"
which sell 6 7900xtx cluster for 15,000$ though 4090 workstation are on the internet
so, anyone have experience with such or knowledge regarding clusters as this would be my first time, please help your boy out.

Link to comment
Share on other sites

Link to post
Share on other sites

The reason you don't see many AMD is most people work with CUDA or toolkits and frameworks designed around CUDA. As much as I like AMD GPUs all you're going to do is give yourself pain and disadvantage yourself by not using Nvidia for AI/ML/HPC etc.

 

Maybe in a few years that will change as AMD actually gets used more in that market segment but for now AMD GPUs is an unwise investment for anything other than gaming.

 

My advice is if you want to do it on the cheap go used/second hand even if you have to ship from another country to yours. You could even get A100 40GB in your stated budget which is many times faster than the 4090 for quite a few Tensor operation and data types as well as just having a lot more VRAM and memory bandwidth.

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, leadeater said:

The reason you don't see many AMD is most people work with CUDA or toolkits and frameworks designed around CUDA. As much as I like AMD GPUs all you're going to do is give yourself pain and disadvantage yourself by not using Nvidia for AI/ML/HPC etc.

 

Maybe in a few years that will change as AMD actually gets used more in that market segment but for now AMD GPUs is an unwise investment for anything other than gaming.

 

My advice is if you want to do it on the cheap go used/second hand even if you have to ship from another country to yours. You could even get A100 40GB in your stated budget which is many times faster than the 4090 for quite a few Tensor operation and data types as well as just having a lot more VRAM and memory bandwidth.

A100 have less fp16 than 4090(78 vs 82.5) and cost 5 times more so that is just more Vram(in cluster that also 4090 will have more)

Link to comment
Share on other sites

Link to post
Share on other sites

10 minutes ago, leadeater said:

The reason you don't see many AMD is most people work with CUDA or toolkits and frameworks designed around CUDA. As much as I like AMD GPUs all you're going to do is give yourself pain and disadvantage yourself by not using Nvidia for AI/ML/HPC etc.

This appears to be changing rapidly.  

Software like ollama is adding AMD GPU support and AMD GPUs are getting used more and more.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, pixeltrainer said:

A100 have less fp16 than 4090(78 vs 82.5) and cost 5 times more so that is just more Vram(in cluster that also 4090 will have more)

Do you want FP16 or Tensor FP16 though? Also going rate for A100 40GB is $6000-$7000 USD.

 

If you need Tensor then dollar for dollar A100 used is actually better than 4090. So just make sure you are comparing the right things, I can't tell you if you need FP16 or Tensor FP16 or BF16 etc.

 

Either way you don't have to buy new.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, ToboRobot said:

This appears to be changing rapidly.  

Software like ollama is adding AMD GPU support and AMD GPUs are getting used more and more.

Yep but you're better off buying for now rather than hope it's better in 2-3 years when what you have is now another generation behind and slower again than the current stuff. Raw performance is often not even half the equation of the situation, doesn't help if you can't leverage that hardware properly due to bad software ecosystem.

Link to comment
Share on other sites

Link to post
Share on other sites

17 minutes ago, leadeater said:

Yep but you're better off buying for now rather than hope it's better in 2-3 years when what you have is now another generation behind and slower again than the current stuff. Raw performance is often not even half the equation of the situation, doesn't help if you can't leverage that hardware properly due to bad software ecosystem.

i havent had a chance to test myself, but people seem to be using them now and not waiting years...

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, ToboRobot said:

i havent had a chance to test myself, but people seem to be using them now and not waiting years...

 

Yes but there is quite a big difference to finding some people using it for some things and it being as widely supported across everything like CUDA is. ROCm is absolutely nowhere near the support and polish that CUDA is. I'm honestly not looking to debate whether or not it's being used, of course people are using AMD GPUs even for AI/ML etc but that doesn't mean it's a good idea for everyone to buy and try.

 

It's just not a good idea to find a single use case and say it's as good as the other, obviously clear market leader with now more than a decade of usage, support and community behind it.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, ToboRobot said:

i havent had a chance to test myself, but people seem to be using them now and not waiting years...

 

 

You still have to go through many hoops to get it properly set up, and as you can see in the link you gave, performance is subpar. If all you want to do is run llama or SD with a stack someone managed to get working for AMD, great! Otherwise you'll face tons of huddles due to unsupported stuff on pytorch/tf.

 

3 hours ago, pixeltrainer said:

so, anyone have experience with such or knowledge regarding clusters as this would be my first time, please help your boy out

Do you want to build boxes to rent out for people?

If the main use case is ML, AMD is pretty much out of question, no one is going to pay for a cloud instance that gives headaches instead of an nvidia box (which you can find tons on vast.ai).

Those AMD GPUs are not even great for HPC since they lack FP64.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×