I have tested deviceQuery in the sbatch again and it works now:
  Device PCI Domain ID / Bus ID / location ID:   0 / 97 / 0
  Device PCI Domain ID / Bus ID / location ID:   0 / 137 / 0
  Device PCI Domain ID / Bus ID / location ID:   0 / 98 / 0
  Device PCI Domain ID / Bus ID / location ID:   0 / 138 / 0

and Aaron is right, that cgroup refers the first allocated GPU as 0, because CUDA_VISIBLE_DEVICES is still set to 0. So IMHO documentation https://slurm.schedmd.com/gres.html is little bit confusing.

I really don't know, where problem was, because when I've tried yesterday, I think, that It didn't work or I've just lost my mind due frustration.
Anyway, problem is solved.

Thanks, Daniel

On 23.05.2019 10:11, Daniel Vecerka wrote:
Jobs ends on the same GPU. If I run CUDA deviceQuery in the sbatch I get:

Device PCI Domain ID / Bus ID / location ID:   0 / 97 / 0
Device PCI Domain ID / Bus ID / location ID:   0 / 97 / 0
Device PCI Domain ID / Bus ID / location ID:   0 / 97 / 0
Device PCI Domain ID / Bus ID / location ID:   0 / 97 / 0

Our cgroup.conf :

/etc/slurm/cgroup.conf
CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes


Daniel

On 23.05.2019 9:54, Aaron Jackson wrote:
Do jobs actually end up on the same GPU though? cgroups will always
refer to the first allocated GPU as 0, so it is not unexpected for each
job have CUDA_VISIBLE_DEVICES set to 0. Make sure you have the following
in /etc/cgroup.conf

    ConstrainDevices=yes

Aaron






Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to