Hi,we are struggling with a slurm 18.08.5 installation of ours. We are in a situation, where our GPU nodes have a considerable number of cores but "only" 2 GPUs inside. While people run jobs using the GPUs, non-GPU jobs can enter alright. However, we found out the hard way, that the inverse is not true.
For example, let's say I have a 4-core GPU node called gpu1. A non-GPU job $ sbatch --wrap="sleep 10 && hostname" -c 3 comes in and starts running on gpu1.We observed that the job produced by the following command targetting the same node:
$ sbatch --wrap="hostname" -c 1 --gres=gpu:1 -w gpu1will wait indefinitely for available resources until the non-gpu job is finished. This is not something we want.
The sample gres.conf and slurm.conf from a docker based slurm cluster where I was able to reproduce the issue are available here:
https://raw.githubusercontent.com/psteinb/docker-centos7-slurm/18.08.5-with-gres/slurm.conf https://raw.githubusercontent.com/psteinb/docker-centos7-slurm/18.08.5-with-gres/gres.confWe are not sure how to handle the situation as we would like both jobs to enter the gpu node and run at the same time to maximize the utility of our hardware to our users.
Any hints or ideas are highly appreciated. Thanks for your help, Peter
smime.p7s
Description: S/MIME Cryptographic Signature