the error message sounds like when you built the slurm source it
wasn't able to find the nvml devel packages.  if you look in where you
installed slurm, in lib/slurm you should have a gpu_nvml.so.  do you?

On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro
<cristobal.navarr...@gmail.com> wrote:
>
> typing error, should be --> **located at /usr/include/nvml.h**
>
> On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro 
> <cristobal.navarr...@gmail.com> wrote:
>>
>> Hi community,
>> I have set up the configuration files as mentioned in the documentation, but 
>> the slurmd of the GPU-compute node fails with the following error shown in 
>> the log.
>> After reading the slurm documentation, it is not entirely clear to me how to 
>> properly set up GPU autodetection for the gres.conf file as it does not 
>> mention if the nvml detection should be automatic or not.
>> I have also read the top google searches including 
>> https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html 
>> but that was a problem of a cuda installation overwritten (not my case).
>> This a DGX A100 node that comes with the Nvidia driver installed and nvml is 
>> located at /etc/include/nvml.h, not sure if there is a libnvml.so or similar 
>> as well.
>> How to tell SLURM to look at those paths? any ideas of experience sharing is 
>> welcome.
>> best
>>
>>
>> slurmd.log (GPU node)
>> [2021-04-14T17:31:42.302] got shutdown request
>> [2021-04-14T17:31:42.302] all threads complete
>> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open 
>> '(null)/tasks' for reading : No such file or directory
>> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of 
>> '(null)'
>> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open 
>> '(null)/tasks' for reading : No such file or directory
>> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids of 
>> '(null)'
>> [2021-04-14T17:31:42.304] debug:  gres/gpu: fini: unloading
>> [2021-04-14T17:31:42.304] debug:  gpu/generic: fini: fini: unloading GPU 
>> Generic plugin
>> [2021-04-14T17:31:42.304] select/cons_tres: common_fini: select/cons_tres 
>> shutting down ...
>> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0
>> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature 
>> plugin unloaded
>> [2021-04-14T17:31:42.304] Slurmd shutdown completing
>> [2021-04-14T17:31:42.321] debug:  Log file re-opened
>> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init
>> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load
>> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml
>> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket
>> [2021-04-14T17:31:42.446] debug:  CPUs:256 Boards:1 Sockets:8 
>> CoresPerSocket:16 ThreadsPerCore:2
>> [2021-04-14T17:31:42.446] debug:  Reading cgroup.conf file 
>> /etc/slurm/cgroup.conf
>> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init
>> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file 
>> (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
>> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket
>> [2021-04-14T17:31:42.448] debug:  CPUs:256 Boards:1 Sockets:8 
>> CoresPerSocket:16 ThreadsPerCore:2
>> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1)
>> [2021-04-14T17:31:42.449] debug:  gres/gpu: init: loaded
>> [2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml 
>> functionality, but we weren't able to find that lib when Slurm was 
>> configured.
>>
>>
>>
>> gres.conf (just AutoDetect=nvml)
>> ➜  ~ cat /etc/slurm/gres.conf
>> # GRES configuration for native GPUS
>> # DGX A100 8x Nvidia A100
>> # not working, slurm cannot find nvml
>> AutoDetect=nvml
>> #Name=gpu File=/dev/nvidia[0-7]
>> #Name=gpu Type=A100 File=/dev/nvidia[0-7]
>> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
>> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
>> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
>> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
>> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
>> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
>> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
>> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
>>
>>
>> slurm.conf
>> GresTypes=gpu
>> AccountingStorageTRES=gres/gpu
>> DebugFlags=CPU_Bind,gres
>>
>> ## We don't want a node to go back in pool without sys admin acknowledgement
>> ReturnToService=0
>>
>> ## Basic scheduling
>> #SelectType=select/cons_res
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
>> SchedulerType=sched/backfill
>>
>> TaskPlugin=task/cgroup
>> ProctrackType=proctrack/cgroup
>>
>> ## Nodes list
>> ## use native GPUs
>> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 
>> RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu
>>
>> ## Partitions list
>> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE 
>> State=UP Nodes=nodeGPU01  Default=YES
>> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 MaxTime=INFINITE 
>> State=UP Nodes=nodeGPU01
>> --
>> Cristóbal A. Navarro
>
>
>
> --
> Cristóbal A. Navarro

Reply via email to