Hi Malk, On Fri, Nov 3, 2017 at 2:14 AM, Maik Schmidt <[email protected]> wrote: > It is my understanding that when ConstrainDevices is not set to "yes", SLURM > uses the so called "Minor Number" (nvidia-smi -q | grep Minor) that is the > number in the device name (/dev/nvidia0 -> ID 0 and so on) and puts it in > the environment variable.
Not exactly. When using ConstrainDevices, Slurm creates a cgroup for the job where only the GPUs that have been allocated to that job are visible. Meaning that on a 4-GPU server, when you submit a job with "--gres gpu:1" and when ContrainDevices is enabled and correctly configured, "nvidia-smi -L" will only list 1 GPU in that job's context. By default, CUDA (the NVML, actually) numbers all the GPUs it has access to from 0. Meaning that in our previous job, the id assigned to that GPU by the NVML will be 0. If, while that job is still running, you submit another 1-GPU job, in the context of that second job, the GPU id will *also* be 0, as this is the only GPU the job will see. You can verify that the physical GPUs assigned to each job are indeed different by looking at either their serial number, PCI address or UUID. This "relative" numbering scheme (by opposition to the absolute numbering scheme that the kernel uses for CPUs, for instance), is a long-debated historical CUDA idiosyncrasy. I don't think it makes a lot of sense in modern day multi-GPU systems, but that's how it is. Some can argue that it simplifies the life of the developer, who can always assume that there will be a GPU 0 in the environment. But it most often leads to horrible assumptions in applications code... > This, however, does not necessarily match the > device index in neither nvml nor CUDA API, nor does it correlate with the > device IDs in CUDA_VISIBLE_DEVICES. > > By default, CUDA uses a heuristic called FASTEST_FIRST to determine the > order with respect to CUDA_VISIBLE_DEVICES, making the fastest GPU device 0 > but leaving the rest of the devices unspecified (see [1]). This behaviour > can be overridden by also setting CUDA_DEVICE_ORDER=PCI_BUS_ID, but even > then, it is not guaranteed that the order of the devices under /dev match > the order of the PCI bus IDs. I think it should, since the driver creates the /dev/ entries using the PCI order too. > Long story short, with the IDs that SLURM puts in CUDA_VISIBLE_DEVICES, we > do not get the right devices selected by CUDA applications which can easily > be verified with e.g. deviceQuery from the CUDA samples. I can see that happening indeed, if the NVML numbering scheme doesn't match the device numbers in /dev. Slurm only knows about the /dev/nvidiaX devices, and that's what it uses to set the value of CUDA_VISIBLE_DEVICES when ConstrainDevices is not enabled (cf. https://bugs.schedmd.com/show_bug.cgi?id=1421 for some historical context). GPU numbering is a giant mess. I think that at some point, NVIDIA should really fix the way GPUs are numbered. It's actually funny to see that even the NVIDIA developers are forced to develop workarounds in their own software: https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation Since this is quite unlikely to happen, one option for better integration would be for Slurm to query the GPU UUIDs and use them to populate CUDA_VISIBLE_DEVICES instead of the current integer indexes. You may want to submit a feature request at https://bugs.schedmd.com if you're interested. But in the meantime, your best option is probably to enable ConstrainDevices to alleviate the issue, or to use CUDA_DEVICE_ORDER=PCI_BUS_ID Cheers, -- Kilian
