Jump to content

SLURM and Servers Configuration

Hi guys !

I need some recommendations on how to make the server configuration more optimal at our office. We are a startup and things are a bit junky 馃槃 So basically we have 3 GPU servers (Ubuntu server) that we use to train deep learning models. To train our model, we connect with SSH to a server, we git pull the repository with our code and then send by Ethernet our data on the server (each data scientist has a session on the server). Then we use VSCode to train and modify our code if necessary. The problem is that this approach is really inefficient, for these reasons:

  1. The data are duplicated on each server (+ personal computer) (all data are stored in the cloud)
  2. It's very difficult to make sure that we all work with the same version of the data
  3. There is no resource management system: everyone can train on any GPU, even if it's already in use !
  4. You can use an Nvidia A100 80Gb GPU while you only need a RTX 4090 or less...

Well I'm sure there is plenty other disadvantage of this way of doing things. Obviously it happened gradually while the company was growing, I don't blame them (especially that everything was done on Azure before).聽

Now what I want to implement:

  1. A NAS to store all our datasets for direct access from the GPU servers, where all the data scientists work on the same data and if someone change the dataset, it is changed for everyone.
  2. Implement a SLURM job management system

I would need recommendations especially for the second point since I only know that SLURM exists and what it does basically. I have no idea how to implement that, what I need etc... For the NAS obviously I will have to have a full SSD NAS, but we only need something like 8To of storage so that's okay. I just wonder how much RAM and CPU power I will need if speed is the key. I was also thinking using a 10Gb switch to connect the NAS and the 3 GPU servers.

I someone has any idea how to implement this and is willing to provide details or guidance, I would be really grateful ! Thank you in advance.

Cheers

Link to comment
Share on other sites

Link to post
Share on other sites

Welcome to the forums!
Looking around it seems like people are recommending a 10G nic, 16-32GB of RAM, and minimum 8 cores of decent CPU for the head node.聽

5950X/3080Ti primary rig聽 |聽 1920X/1070Ti Unraid for dockers聽 |聽 200TB TrueNAS w/ 1:1 backup

Link to comment
Share on other sites

Link to post
Share on other sites

Hi ! Thank you very much, it seems indeed a great choice. Cheers

Link to comment
Share on other sites

Link to post
Share on other sites

Hi welcome to forums.

Here are some mentions of possible program's that may help you. 馃挕

You can have look into truenas scale as for NAS it has pretty lot features you can use NFS for communication between GPU server and NAS server (direct connection)

As for GPU you can have look for professional product called XCP-NG聽

That program is made for Virtualisation and driveless so i heard. I wish i had money 馃挵 to buy hardware and actually trying to use those for my tinker land.

You can also have a look into Docker Container which supports GPU passthrough for applications and there are opensource webgui everywhere and also uses nfs share like Portainer it has team system in it which is likely capable to control GPUs and it seems pretty neat for webgui for docker free and paid.

Plenty ways to do it but i don't know what applies you the best.

Good luck 馃憤

I'm jank tinkerer if it works then it works.

Regardless of compatibility 馃惂馃枛

Link to comment
Share on other sites

Link to post
Share on other sites

How many scientists do you have?

SLURM is nice, but also pretty annoying. As an example, if your devs are using VSCode jupyter notebooks to do stuff on the GPU server, it wouldn't be possible with SLURM anymore since there are no interactive sessions (AFAIK).

If you're ok with having your devs submit a bash job and do some workarounds to start a jupyter server and then connect VSCode to it, then it could work, and you wouldn't have the issue of multiple folks trying to use the same GPU anymore. You could also look into MIG instances for that A100 to properly share it.

FX6300 @ 4.2GHz | Gigabyte GA-78LMT-USB3 R2 | Hyper 212x | 3x 8GB + 1x 4GB @ 1600MHz | Gigabyte 2060 Super | Corsair CX650M | LG 43UK6520PSA
ASUS X550LN | i5 4210u | 12GB
Lenovo N23 Yoga

Link to comment
Share on other sites

Link to post
Share on other sites

Have a look at Nvidia Cluster Manager, it'll do most of the hard work for you.

https://www.nvidia.com/en-us/data-center/base-command/manager/

But as @igormp聽has pointed out it's mostly about how you do your processes, your workflows and how you do things more than just finding a job management system like SLURM/setuping a cluster.

Link to comment
Share on other sites

Link to post
Share on other sites

Hi guys,

@BoomerDutch聽Indeed TrueNAS Scale fits exactly our needs. I'll have a look at the options you mentioned, thank you for your help!

@igormp聽At the moment we are 4 data scientists and we only use the servers for training purpose. Each of us has a gaming laptop with a decent GPU (i.e. RTX4070) so inference and Jupyter notebook stuff can be done on our machine. But indeed, thanks for pointing out that it's not going to be as straightforward to run code on the servers. Thanks for your help!

@leadeater聽I'll have a look to that, thanks you!

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now