Hi I'm using slurm with GRES(4 GPU). I wanna allocate jobs uniformly through GRES(especially GPU). But, It does not work when I use Docker. For example, If i run this command for 4 times with different tty, I could get what i want to get. As you can see, All Bus-Ids are different.
#1 $ srun --gres=gpu:1 --gres-flags=enforce-binding --cpus-per-task=8 --mem=20G --pty bash $ nvidia-smi Wed Jan 2 01:02:00 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:14:00.0 Off | 0 | | N/A 30C P0 27W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ #2 $ srun --gres=gpu:1 --gres-flags=enforce-binding --cpus-per-task=8 --mem=20G --pty bash $ nvidia-smi Wed Jan 2 01:02:39 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:15:00.0 Off | 0 | | N/A 32C P0 26W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ #3 $ srun --gres=gpu:1 --gres-flags=enforce-binding --cpus-per-task=8 --mem=20G --pty bash $ nvidia-smi Wed Jan 2 00:36:22 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:39:00.0 Off | 0 | | N/A 30C P0 27W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ #4 $ srun --gres=gpu:1 --gres-flags=enforce-binding --cpus-per-task=8 --mem=20G --pty bash $ nvidia-smi Wed Jan 2 01:03:50 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:3A:00.0 Off | 0 | | N/A 29C P0 27W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ Also scontrol show Command is OK. All of GRES_IDXs are different. $ scontrol show job=472 --details JobId=472 JobName=bash UserId=root(0) GroupId=root(0) MCS_label=N/A Priority=4294901759 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:29:12 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2019-01-02T00:35:37 EligibleTime=2019-01-02T00:35:37 StartTime=2019-01-02T00:35:37 EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=all AllocNode:Sid=...:30423 ReqNodeList=(null) ExcNodeList=(null) NodeList=... BatchHost=... NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=20G,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=... CPU_IDs=0-7 Mem=20480 GRES_IDX=gpu(IDX:0) MinCPUsNode=8 MinMemoryNode=20G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=gpu:1 Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/etc/slurm Power= GresEnforceBind=Yes $ scontrol show job=473 --details JobId=473 JobName=bash UserId=root(0) GroupId=root(0) MCS_label=N/A Priority=4294901758 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:30:10 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2019-01-02T00:36:14 EligibleTime=2019-01-02T00:36:14 StartTime=2019-01-02T00:36:14 EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=all AllocNode:Sid=...:31738 ReqNodeList=(null) ExcNodeList=(null) NodeList=... BatchHost=... NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:* TRES=cpu=8,mem=20G,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=... CPU_IDs=8-15 Mem=20480 GRES_IDX=gpu(IDX:1) MinCPUsNode=8 MinMemoryNode=20G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=gpu:1 Reservation=(null) OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=bash WorkDir=/root Power= GresEnforceBind=Yes ... But the problem is here. when I use Docker, Slurm GRES is not working. $ srun --gres=gpu:1 --gres-flags=enforce-binding --cpus-per-task=8 --mem=20G --pty bash $ nvidia-smi Wed Jan 2 01:02:00 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On | 00000000:14:00.0 Off | 0 | | N/A 30C P0 27W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ $ docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi Wed Jan 2 01:10:35 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-PCIE... On |00000000:14:00.0 Off | 0 | | N/A 30C P0 27W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-PCIE... On | 00000000:15:00.0 Off | 0 | | N/A 32C P0 26W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla P100-PCIE... On | 00000000:39:00.0 Off | 0 | | N/A 30C P0 27W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla P100-PCIE... On | 00000000:3A:00.0 Off | 0 | | N/A 28C P0 27W / 250W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ Here is my configs(slurm.conf, gres.conf) ControlMachine=... ControlAddr=... MailProg=/bin/mail MpiDefault=none ReturnToService=2 SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid SlurmdSpoolDir=/var/spool/slurmd SlurmdUser=root StateSaveLocation=/var/spool SwitchType=switch/none ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup AuthType=auth/munge FastSchedule=1 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory GresTypes=gpu AccountingStorageType=accounting_storage/filetxt JobCompType=jobcomp/filetxt JobAcctGatherType=jobacct_gather/cgroup ClusterName=... SlurmctldDebug=7 SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=7 SlurmdLogFile=/var/log/slurmd.log # COMPUTE NODES NodeName=... NodeHostName=... Gres=gpu:4 CPUs=32 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=128432 State=UNKNOWN PartitionName=all Nodes=... Default=YES MaxTime=INFINITE State=UP Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0-7 Name=gpu Type=tesla File=/dev/nvidia1 CPUs=8-15 Name=gpu Type=tesla File=/dev/nvidia2 CPUs=16-23 Name=gpu Type=tesla File=/dev/nvidia3 CPUs=24-31 what's the problem?