So this is related to the gpu/nvml plugin in the source code tree. That didn't get built because I didn't have the nvidia driver (really the library libnvidia-ml.so) installed when I built the code. I see in config.log where it tries to find -lnvidia-ml and it skips building the gpu.nvml plugin if it doesn't find it.
So in order to use Autodetect=nvml in gres.conf you have to install the nvidia driver before building the source code. I wish they would document some of these things. On Fri, Feb 7, 2020 at 9:59 AM Stephan Roth <stephan.r...@ee.ethz.ch> wrote: > gpu_nvml.so links to libnvidia-ml.so: > > $ ldd lib/slurm/gpu_nvml.so > ... > libnvidia-ml.so.1 => /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 > (0x00007f2d2bac8000) > ... > > When you run configure you'll see something along these lines: > > > On 07.02.20 17:03, dean.w.schu...@gmail.com wrote: > > I just checked the .deb package that I build from source and there is > nothing in it that has nv or cuda in its name. > > > > Are you sure that slurm distributes nvidia binaries? > > > > -----Original Message----- > > From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of > Stephan Roth > > Sent: Friday, February 7, 2020 2:23 AM > > To: slurm-users@lists.schedmd.com > > Subject: Re: [slurm-users] How to use Autodetect=nvml in gres.conf > > > > On 05.02.20 21:06, Dean Schulze wrote: > > > I need to dynamically configure gpus on my nodes. The gres.conf doc > > > says to use > > > > > > Autodetect=nvml > > > > That's all you need in gres.conf provided you don't configure any > > Gres=... entries for your nodes in your slurm.conf. > > If you do, make sure the string matches what NVML discovers, i.e. > > lowercase and underscores instead of spaces or dashes. > > > > The upside of configuring everything is you will be informed in case the > > automatically detected GPUs in a node don't match what you configured. > > > > > in gres.conf instead of adding configuration details to each gpu in > > > gres.conf. The docs aren't really clear about this because they > show an > > > example with the details for each gpu: > > > > > > AutoDetect=nvml > > > Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1 > > > Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1 > > > Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3 > > > Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3 > > > Name=mps Count=200 File=/dev/nvidia0 > > > Name=mps Count=200 File=/dev/nvidia1 > > > Name=mps Count=100 File=/dev/nvidia2 > > > Name=mps Count=100 File=/dev/nvidia3 > > > Name=bandwidth Type=lustre Count=4G > > > > > > First Question: If I use Autodetect=nvml do I also need to specify > > > File= and Cores= for each gpu in gres.conf? I'm hoping that with > > > Autodetect=nvml that all I need is the Name= and Type= for each gpu. > > > Otherwise it's not clear what the purpose of setting Autodetect=nvml > > > would be. > > > > > > Second Question: I installed the CUDA tools from the binary > > > cuda_10.2.89_440.33.01_linux.run. When I restart slurmd with > > > Autodetect=nvml in gres.conf I get this error: > > > > > > fatal: We were configured to autodetect nvml functionality, but we > > > weren't able to find that lib when Slurm was configured. > > > > > > Is there something else I need to configure to tell slurmd how to use > > nvml? > > > > I guess the version of slurm you're using was linked against a version > > of NVML which has been overwritten by your installation of Cuda 10.2 > > > > If that's the case there are various ways to solve that problem, but > > that depends on your reason to install Cuda 10.2. > > > > My recommendation is to use the Cuda version of your system matching > > your system's slurm package and to install Cuda 10.2 in a non-default > > location, provided you need to make it available on a cluster node. > > > > If people using your cluster ask for Cuda 10.2 they have the option of > > using a virtual conda environment and install Cuda 10.2 there. > > > > > > Cheers, > > Stephan > > > > > > > > > ------------------------------------------------------------------- > Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch > +4144 632 30 59 | ETF D 104 | Sternwartstrasse 7 | 8092 Zurich > ------------------------------------------------------------------- > GPG Fingerprint: E2B9 1B4F 4D35 F233 BE12 1BE9 B423 4018 FBC0 EA17 > >