[slurm-users] Slurm doesn't allocate job on available MIGs

Tristan Gillard Tue, 12 Dec 2023 00:50:42 -0800

Hello,

we have a problem on a DGX where the 4 A100s are split into different MIGs 
(Multi-Instance GPUs).


We use slurm to allocate jobs on partitions grouping MIGs according to their 
size:
- prod10 for 10 x 1g10gb
- prod20 for 4 x 2g20gb
- prod40 for 1 x 3g40gb
- prod80 for 1 x A100g80gb

The problem encountered is, for example:
1. a first job runs on prod40
2. a second job is pending for its place on prod40, since no more 3g40gb MIGs 
are available (reason: ressources)
3. a third job is waiting to run on prod10, even though all 10 1g10gb MIGs are 
available (reason: Nodes required for job are DOWN, DRAINED or reserved for 
jobs in higher priority partitions)

We don't understand why slurm doesn't allocate 1g.10gb MIGs for the 3rd job, 
which we don't think should have to wait.

In the event that there is no second job waiting, the jobs can use prod10 
without waiting.

The slurm.conf and gres.conf files are available as attachments.

Can anyone help us solve the problem?

Have a nice day,

Tristan Gillard

dgx_gres.conf
Description: dgx_gres.conf

dgx_slurm.conf
Description: dgx_slurm.conf

[slurm-users] Slurm doesn't allocate job on available MIGs

Reply via email to