I have tested deviceQuery in the sbatch again and it works now:
Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0
Device PCI Domain ID / Bus ID / location ID: 0 / 137 / 0
Device PCI Domain ID / Bus ID / location ID: 0 / 98 / 0
Device PCI Domain ID / Bus ID / location ID: 0 /
Jobs ends on the same GPU. If I run CUDA deviceQuery in the sbatch I get:
Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0
Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0
Device PCI Domain ID / Bus ID / location ID: 0 / 97 / 0
Device PCI Domain ID / Bus ID / location ID: 0
> Hello,
>
> we are running 18.08.6 and has problems with GRES GPU management.
> There is "gpu" partition with 12 nodes each with 4 Tesla V100 cards. An
> allocation of the GPUs is working, GPU management for sbatch/srun jobs
> is working too - CUDA_VISIBLE_DEVICES is correctly set according
>
Hello,
we are running 18.08.6 and has problems with GRES GPU management.
There is "gpu" partition with 12 nodes each with 4 Tesla V100 cards. An
allocation of the GPUs is working, GPU management for sbatch/srun jobs
is working too - CUDA_VISIBLE_DEVICES is correctly set according
--gres=gp