Hello, we have a problem on a DGX where the 4 A100s are split into different MIGs (Multi-Instance GPUs).
We use slurm to allocate jobs on partitions grouping MIGs according to their size: - prod10 for 10 x 1g10gb - prod20 for 4 x 2g20gb - prod40 for 1 x 3g40gb - prod80 for 1 x A100g80gb The problem encountered is, for example: 1. a first job runs on prod40 2. a second job is pending for its place on prod40, since no more 3g40gb MIGs are available (reason: ressources) 3. a third job is waiting to run on prod10, even though all 10 1g10gb MIGs are available (reason: Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) We don't understand why slurm doesn't allocate 1g.10gb MIGs for the 3rd job, which we don't think should have to wait. In the event that there is no second job waiting, the jobs can use prod10 without waiting. The slurm.conf and gres.conf files are available as attachments. Can anyone help us solve the problem? Have a nice day, Tristan Gillard
dgx_gres.conf
Description: dgx_gres.conf
dgx_slurm.conf
Description: dgx_slurm.conf