Hello guys,

I'd like to ask the tips for GPU resource sharing with Slurm. I have multiple 
GPUs in my cluster and multiple users that spawn the
jobs as the slurm batch job. However, the GPU resource usage is depending on 
what the job doing and unevenness so some jobs doesn't
use GPU (a little of times) so much. On such cases, I'd like to make the jobs 
to be able to share the GPU resource like assigning 0.
5 GPU (1 means the job uses 1 GPU, like --gres=gpu:1).

Before asking here, I tried Slurm/mps 
(https://slurm.schedmd.com/gres.html#MPS_Management) that says "the same GPU 
can be allocated
as MPS generic resources to multiple jobs belonging to multiple users". 
However, that doesn't work as I expected at all. At first,
Slurm seems to work as designed. I put the mps configuration to slurm, turn on 
cons_tres plugin, then requiring small number of mps
count than mps count in gres.conf can start to assign multiple jobs into a 
node. However, mps server in the node doesn't when
*multiple users* request the jobs. At the case, it looks like an user's job is 
waiting to hold the GPU until another job holding the
GPU is running as well as gres gpu:1. And more, the NVIDIA docs looks to 
describe what I hit
(https://docs.nvidia.com/deploy/mps/index.html#topic_4_3). That seems like the 
mps-server will be created to each user and the
server will be running exclusively so I have my doubts the direction...

Here is where I stand for now but I'm not sure if it's expected behavior or 
not. Thus I'd like to hear the opinions because I may be
missing something or I may have another way to share GPU resources rather than 
mps.

Does anyone hits the same issue? Would anyone help me?

Thanks,

--------------------------------------------
露崎 浩太 (Kota Tsuyuzaki)
kota.tsuyuzaki...@hco.ntt.co.jp
NTT Software Innovation Center
+81-422-59-2837
---------------------------------------------





Reply via email to