typing error, should be --> **located at /usr/include/nvml.h** On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro < cristobal.navarr...@gmail.com> wrote:
> Hi community, > I have set up the configuration files as mentioned in the documentation, > but the slurmd of the GPU-compute node fails with the following error shown > in the log. > After reading the slurm documentation, it is not entirely clear to me how > to properly set up GPU autodetection for the gres.conf file as it does not > mention if the nvml detection should be automatic or not. > I have also read the top google searches including > https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html > but that was a problem of a cuda installation overwritten (not my case). > This a DGX A100 node that comes with the Nvidia driver installed and nvml > is located at /etc/include/nvml.h, not sure if there is a libnvml.so or > similar as well. > How to tell SLURM to look at those paths? any ideas of experience sharing > is welcome. > best > > > *slurmd.log (GPU node)* > [2021-04-14T17:31:42.302] got shutdown request > [2021-04-14T17:31:42.302] all threads complete > [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open > '(null)/tasks' for reading : No such file or directory > [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of > '(null)' > [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open > '(null)/tasks' for reading : No such file or directory > [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of > '(null)' > [2021-04-14T17:31:42.304] debug: gres/gpu: fini: unloading > [2021-04-14T17:31:42.304] debug: gpu/generic: fini: fini: unloading GPU > Generic plugin > [2021-04-14T17:31:42.304] select/cons_tres: common_fini: select/cons_tres > shutting down ... > [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0 > [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature > plugin unloaded > [2021-04-14T17:31:42.304] Slurmd shutdown completing > [2021-04-14T17:31:42.321] debug: Log file re-opened > [2021-04-14T17:31:42.321] debug2: hwloc_topology_init > [2021-04-14T17:31:42.321] debug2: hwloc_topology_load > [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml > [2021-04-14T17:31:42.446] Considering each NUMA node as a socket > [2021-04-14T17:31:42.446] debug: CPUs:256 Boards:1 Sockets:8 > CoresPerSocket:16 ThreadsPerCore:2 > [2021-04-14T17:31:42.446] debug: Reading cgroup.conf file > /etc/slurm/cgroup.conf > [2021-04-14T17:31:42.447] debug2: hwloc_topology_init > [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file > (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found > [2021-04-14T17:31:42.448] Considering each NUMA node as a socket > [2021-04-14T17:31:42.448] debug: CPUs:256 Boards:1 Sockets:8 > CoresPerSocket:16 ThreadsPerCore:2 > [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1) > [2021-04-14T17:31:42.449] debug: gres/gpu: init: loaded > *[2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml > functionality, but we weren't able to find that lib when Slurm was > configured.* > > > > > *gres.conf (just AutoDetect=nvml)* > ➜ ~ cat /etc/slurm/gres.conf > # GRES configuration for native GPUS > # DGX A100 8x Nvidia A100 > # not working, slurm cannot find nvml > AutoDetect=nvml > #Name=gpu File=/dev/nvidia[0-7] > #Name=gpu Type=A100 File=/dev/nvidia[0-7] > #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7 > #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15 > #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23 > #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31 > #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39 > #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47 > #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55 > #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63 > > > *slurm.conf* > GresTypes=gpu > AccountingStorageTRES=gres/gpu > DebugFlags=CPU_Bind,gres > > ## We don't want a node to go back in pool without sys admin > acknowledgement > ReturnToService=0 > > ## Basic scheduling > #SelectType=select/cons_res > SelectType=select/cons_tres > SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE > SchedulerType=sched/backfill > > TaskPlugin=task/cgroup > ProctrackType=proctrack/cgroup > > ## Nodes list > ## use native GPUs > NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 > RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu > > ## Partitions list > PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE > State=UP Nodes=nodeGPU01 Default=YES > PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE > State=UP Nodes=nodeGPU01 > -- > Cristóbal A. Navarro > -- Cristóbal A. Navarro