[slurm-dev] Re: Wrong device order in CUDA_VISIBLE_DEVICES

Kilian Cavalotti Fri, 03 Nov 2017 09:06:47 -0700

Hi Malk,

On Fri, Nov 3, 2017 at 2:14 AM, Maik Schmidt <[email protected]> wrote:
> It is my understanding that when ConstrainDevices is not set to "yes", SLURM
> uses the so called "Minor Number" (nvidia-smi -q | grep Minor) that is the
> number in the device name (/dev/nvidia0 -> ID 0 and so on) and puts it in
> the environment variable.


Not exactly. When using ConstrainDevices, Slurm creates a cgroup for
the job where only the GPUs that have been allocated to that job are
visible. Meaning that on a 4-GPU server, when you submit a job with
"--gres gpu:1" and when ContrainDevices is enabled and correctly
configured, "nvidia-smi -L" will only list 1 GPU in that job's
context.

By default, CUDA (the NVML, actually) numbers all the GPUs it has
access to from 0. Meaning that in our previous job, the id assigned to
that GPU by the NVML will be 0. If, while that job is still running,
you submit another 1-GPU job, in the context of that second job, the
GPU id will *also* be 0, as this is the only GPU the job will see. You
can verify that the physical GPUs assigned to each job are indeed
different by looking at either their serial number, PCI address or
UUID.

This "relative" numbering scheme (by opposition to the absolute
numbering scheme that the kernel uses for CPUs, for instance), is a
long-debated historical CUDA idiosyncrasy. I don't think it makes a
lot of sense in modern day multi-GPU systems, but that's how it is.
Some can argue that it simplifies the life of the developer, who can
always assume that there will be a GPU 0 in the environment. But it
most often leads to horrible assumptions in applications code...

> This, however, does not necessarily match the
> device index in neither nvml nor CUDA API, nor does it correlate with the
> device IDs in CUDA_VISIBLE_DEVICES.
>
> By default, CUDA uses a heuristic called FASTEST_FIRST to determine the
> order with respect to CUDA_VISIBLE_DEVICES, making the fastest GPU device 0
> but leaving the rest of the devices unspecified (see [1]).  This behaviour
> can be overridden by also setting CUDA_DEVICE_ORDER=PCI_BUS_ID, but even
> then, it is not guaranteed that the order of the devices under /dev match
> the order of the PCI bus IDs.

I think it should, since the driver creates the /dev/ entries using
the PCI order too.

> Long story short, with the IDs that SLURM puts in CUDA_VISIBLE_DEVICES, we
> do not get the right devices selected by CUDA applications which can easily
> be verified with e.g. deviceQuery from the CUDA samples.

I can see that happening indeed, if the NVML numbering scheme doesn't
match the device numbers in /dev. Slurm only knows about the
/dev/nvidiaX devices, and that's what it uses to set the value of
CUDA_VISIBLE_DEVICES when ConstrainDevices is not enabled (cf.
https://bugs.schedmd.com/show_bug.cgi?id=1421 for some historical
context).

GPU numbering is a giant mess. I think that at some point, NVIDIA
should really fix the way GPUs are numbered. It's actually funny to
see that even the NVIDIA developers are forced to develop workarounds
in their own software:
https://github.com/NVIDIA/nvidia-docker/wiki/GPU-isolation

Since this is quite unlikely to happen, one option for better
integration would be for Slurm to query the GPU UUIDs and use them to
populate CUDA_VISIBLE_DEVICES instead of the current integer indexes.
You may want to submit a feature request at https://bugs.schedmd.com
if you're interested. But in the meantime, your best option is
probably to enable ConstrainDevices to alleviate the issue, or to use
CUDA_DEVICE_ORDER=PCI_BUS_ID

Cheers,
-- 
Kilian

[slurm-dev] Re: Wrong device order in CUDA_VISIBLE_DEVICES

Reply via email to