Hi all,

We have a GPU cluster and have run into this issue occasionally. Assume four GPUs per node; when a user requests a GPU on such a node, and all the cores, or all the RAM, the other three GPUs will be wasted for the duration of the job, as slurm has no more cores or RAM available to allocate those GPUs to subsequent jobs.


We have a "soft" solution to this, but it's not ideal. That is, we assigned large TresBillingWeights to cpu consumption, thus discouraging users to allocate many CPUs.


Ideal for us would be to be able to define a number of CPUs to always be available on a node, for each GPU. Would help to a similar feature for an amount of RAM.


Take for example a node that has:

* four GPUs

* 16 CPUs


Let's assume that most jobs would work just fine with a minimum number of 2 CPUs per GPU. Then we could set in the node definition a variable such as

  CpusReservedPerGpu = 2

The first job to run on this node could get between 2 and 10 CPUs, thus 6 CPUs remaining for potential incoming jobs (2 per GPU).


We couldn't find a way to do this, are we missing something? We'd rather not modify the source code again :/

Regards,

Relu


Reply via email to