the error message sounds like when you built the slurm source it wasn't able to find the nvml devel packages. if you look in where you installed slurm, in lib/slurm you should have a gpu_nvml.so. do you?
On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro <cristobal.navarr...@gmail.com> wrote: > > typing error, should be --> **located at /usr/include/nvml.h** > > On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro > <cristobal.navarr...@gmail.com> wrote: >> >> Hi community, >> I have set up the configuration files as mentioned in the documentation, but >> the slurmd of the GPU-compute node fails with the following error shown in >> the log. >> After reading the slurm documentation, it is not entirely clear to me how to >> properly set up GPU autodetection for the gres.conf file as it does not >> mention if the nvml detection should be automatic or not. >> I have also read the top google searches including >> https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html >> but that was a problem of a cuda installation overwritten (not my case). >> This a DGX A100 node that comes with the Nvidia driver installed and nvml is >> located at /etc/include/nvml.h, not sure if there is a libnvml.so or similar >> as well. >> How to tell SLURM to look at those paths? any ideas of experience sharing is >> welcome. >> best >> >> >> slurmd.log (GPU node) >> [2021-04-14T17:31:42.302] got shutdown request >> [2021-04-14T17:31:42.302] all threads complete >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open >> '(null)/tasks' for reading : No such file or directory >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of >> '(null)' >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open >> '(null)/tasks' for reading : No such file or directory >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of >> '(null)' >> [2021-04-14T17:31:42.304] debug: gres/gpu: fini: unloading >> [2021-04-14T17:31:42.304] debug: gpu/generic: fini: fini: unloading GPU >> Generic plugin >> [2021-04-14T17:31:42.304] select/cons_tres: common_fini: select/cons_tres >> shutting down ... >> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0 >> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature >> plugin unloaded >> [2021-04-14T17:31:42.304] Slurmd shutdown completing >> [2021-04-14T17:31:42.321] debug: Log file re-opened >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load >> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml >> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket >> [2021-04-14T17:31:42.446] debug: CPUs:256 Boards:1 Sockets:8 >> CoresPerSocket:16 ThreadsPerCore:2 >> [2021-04-14T17:31:42.446] debug: Reading cgroup.conf file >> /etc/slurm/cgroup.conf >> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init >> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file >> (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found >> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket >> [2021-04-14T17:31:42.448] debug: CPUs:256 Boards:1 Sockets:8 >> CoresPerSocket:16 ThreadsPerCore:2 >> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1) >> [2021-04-14T17:31:42.449] debug: gres/gpu: init: loaded >> [2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml >> functionality, but we weren't able to find that lib when Slurm was >> configured. >> >> >> >> gres.conf (just AutoDetect=nvml) >> ➜ ~ cat /etc/slurm/gres.conf >> # GRES configuration for native GPUS >> # DGX A100 8x Nvidia A100 >> # not working, slurm cannot find nvml >> AutoDetect=nvml >> #Name=gpu File=/dev/nvidia[0-7] >> #Name=gpu Type=A100 File=/dev/nvidia[0-7] >> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7 >> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15 >> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23 >> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31 >> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39 >> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47 >> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55 >> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63 >> >> >> slurm.conf >> GresTypes=gpu >> AccountingStorageTRES=gres/gpu >> DebugFlags=CPU_Bind,gres >> >> ## We don't want a node to go back in pool without sys admin acknowledgement >> ReturnToService=0 >> >> ## Basic scheduling >> #SelectType=select/cons_res >> SelectType=select/cons_tres >> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE >> SchedulerType=sched/backfill >> >> TaskPlugin=task/cgroup >> ProctrackType=proctrack/cgroup >> >> ## Nodes list >> ## use native GPUs >> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 >> RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu >> >> ## Partitions list >> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE >> State=UP Nodes=nodeGPU01 Default=YES >> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE >> State=UP Nodes=nodeGPU01 >> -- >> Cristóbal A. Navarro > > > > -- > Cristóbal A. Navarro