SLURM and Servers Configuration

nbstackpie · March 7

Hi guys !

I need some recommendations on how to make the server configuration more optimal at our office. We are a startup and things are a bit junky So basically we have 3 GPU servers (Ubuntu server) that we use to train deep learning models. To train our model, we connect with SSH to a server, we git pull the repository with our code and then send by Ethernet our data on the server (each data scientist has a session on the server). Then we use VSCode to train and modify our code if necessary. The problem is that this approach is really inefficient, for these reasons:

The data are duplicated on each server (+ personal computer) (all data are stored in the cloud)
It's very difficult to make sure that we all work with the same version of the data
There is no resource management system: everyone can train on any GPU, even if it's already in use !
You can use an Nvidia A100 80Gb GPU while you only need a RTX 4090 or less...

Well I'm sure there is plenty other disadvantage of this way of doing things. Obviously it happened gradually while the company was growing, I don't blame them (especially that everything was done on Azure before).

Now what I want to implement:

A NAS to store all our datasets for direct access from the GPU servers, where all the data scientists work on the same data and if someone change the dataset, it is changed for everyone.
Implement a SLURM job management system

I would need recommendations especially for the second point since I only know that SLURM exists and what it does basically. I have no idea how to implement that, what I need etc... For the NAS obviously I will have to have a full SSD NAS, but we only need something like 8To of storage so that's okay. I just wonder how much RAM and CPU power I will need if speed is the key. I was also thinking using a 10Gb switch to connect the NAS and the 3 GPU servers.

I someone has any idea how to implement this and is willing to provide details or guidance, I would be really grateful ! Thank you in advance.

Cheers

OddOod · March 7

Welcome to the forums!
Looking around it seems like people are recommending a 10G nic, 16-32GB of RAM, and minimum 8 cores of decent CPU for the head node.

nbstackpie · March 7

Hi ! Thank you very much, it seems indeed a great choice. Cheers

BoomerDutch · March 7

Hi welcome to forums.

Here are some mentions of possible program's that may help you.

You can have look into truenas scale as for NAS it has pretty lot features you can use NFS for communication between GPU server and NAS server (direct connection)

As for GPU you can have look for professional product called XCP-NG

That program is made for Virtualisation and driveless so i heard. I wish i had money to buy hardware and actually trying to use those for my tinker land.

You can also have a look into Docker Container which supports GPU passthrough for applications and there are opensource webgui everywhere and also uses nfs share like Portainer it has team system in it which is likely capable to control GPUs and it seems pretty neat for webgui for docker free and paid.

Plenty ways to do it but i don't know what applies you the best.

Good luck

igormp · March 7

How many scientists do you have?

SLURM is nice, but also pretty annoying. As an example, if your devs are using VSCode jupyter notebooks to do stuff on the GPU server, it wouldn't be possible with SLURM anymore since there are no interactive sessions (AFAIK).

If you're ok with having your devs submit a bash job and do some workarounds to start a jupyter server and then connect VSCode to it, then it could work, and you wouldn't have the issue of multiple folks trying to use the same GPU anymore. You could also look into MIG instances for that A100 to properly share it.

leadeater · March 7

Have a look at Nvidia Cluster Manager, it'll do most of the hard work for you.

https://www.nvidia.com/en-us/data-center/base-command/manager/

But as @igormp has pointed out it's mostly about how you do your processes, your workflows and how you do things more than just finding a job management system like SLURM/setuping a cluster.

nbstackpie · March 8

Hi guys,

@BoomerDutch Indeed TrueNAS Scale fits exactly our needs. I'll have a look at the options you mentioned, thank you for your help!

@igormp At the moment we are 4 data scientists and we only use the servers for training purpose. Each of us has a gaming laptop with a decent GPU (i.e. RTX4070) so inference and Jupyter notebook stuff can be done on our machine. But indeed, thanks for pointing out that it's not going to be as straightforward to run code on the servers. Thanks for your help!

@leadeater I'll have a look to that, thanks you!

Sign In

SLURM and Servers Configuration

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Featured Topics

Topics

Latest From Linus Tech Tips:

I shouldn’t have kept the $1,000,000 computer

Latest From Tech Quickie:

This Guy BUILT His Own Graphics Card!

Latest From TechLinked:

Microsoft, Give Up Already.

Latest From GameLinked:

Roblox and Walmart... Are One

Latest From ShortCircuit:

Dell Has Destroyed the XPS - Dell XPS 16 (2024)

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!