Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) Thu, 18 Jan 2024 05:31:20 -0800

On Jan 18, 2024, at 7:31 AM, Matthias Loose <m.lo...@mindcode.de> wrote:


Hi Hafedh,

Im no expert in the GPU side of SLURM, but looking at you current configuration 
to me its working as intended at the moment. You have defined 4 GPUs and start 
multiple jobs each consuming 4 GPUs each. So the jobs wait for the ressource 
the be free again.

I think what you need to look into is the MPS plugin, which seems to do what 
you are trying to achieve:
https://slurm.schedmd.com/gres.html#MPS_Management

I agree with the first paragraph.  How many GPUs are you expecting each job to 
use? I'd have assumed, based on the original text, that each job is supposed to 
use 1 GPU, and the 4 jobs were supposed to be running side-by-side on the one 
node you have (with 4 GPUs).  If so, you need to tell each job to request only 
1 GPU, and currently each one is requesting 4.

If your jobs are actually supposed to be using 4 GPUs each, I still don't see 
any advantage to MPS (at least in what is my usual GPU usage pattern): all the 
jobs will take longer to finish, because they are sharing the fixed resource. 
If they take turns, at least the first ones finish as fast as they can, and the 
last one will finish no later than it would have if they were all time-sharing 
the GPUs.  I guess NVIDIA had something in mind when they developed MPS, so I 
guess our pattern may not be typical (or at least not universal), and in that 
case the MPS plugin may well be what you need.

Re: [slurm-users] Need help with running multiple instances/executions of a batch script in parallel (with NVIDIA HGX A100 GPU as a Gres)

Reply via email to