Re: [slurm-users] AutoDetect=nvml throwing an error message

Stephan Roth Fri, 16 Apr 2021 03:20:40 -0700

Hi Cristóbal

Under Debian Stretch/Buster I had to set LDFLAGS=-L/usr/lib/x86_64-linux-gnu/nvidia/current for configure to find the NVML shared library.


Best,
Stephan

On 15.04.21 19:46, Cristóbal Navarro wrote:

Hi Michael,
Thanks, Indeed I don't have it. Slurm must have not detected it.
I double checked and NVML is installed (libnvidia-ml-dev for Ubuntu)
Here is some output, including the relevant paths for nvml.
Is it possible to tell the slurm compilation to check these paths for nvml ?
best

*NVML PKG CHECK*
➜  ~ sudo apt search nvml
Sorting... Done
Full Text Search... Done
cuda-nvml-dev-11-0/unknown 11.0.167-1 amd64
   NVML native dev links, headers

cuda-nvml-dev-11-1/unknown,unknown 11.1.74-1 amd64
   NVML native dev links, headers

cuda-nvml-dev-11-2/unknown,unknown 11.2.152-1 amd64
   NVML native dev links, headers

*libnvidia-ml-dev/focal,now 10.1.243-3 amd64 [installed]
   NVIDIA Management Library (NVML) development files*
python3-pynvml/focal 7.352.0-3 amd64
   Python3 bindings to the NVIDIA Management Library



*NVML Shared library location*
➜  ~ find /usr/lib | grep libnvidia-ml
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.102.04
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so



*NVML Header*
➜  ~ find /usr | grep nvml
/usr/include/nvml.h




*SLURM LIBS*
➜  ~ ls /usr/lib64/slurm/
accounting_storage_mysql.so* core_spec_none.so* job_submit_pbs.so* proctrack_pgid.so* accounting_storage_none.so* cred_munge.so* job_submit_require_timelimit.so* route_default.so* accounting_storage_slurmdbd.so* cred_none.so* job_submit_throttle.so* route_topology.so* acct_gather_energy_ibmaem.so* ext_sensors_none.so* launch_slurm.so* sched_backfill.so* acct_gather_energy_ipmi.so* gpu_generic.so* mcs_account.so* sched_builtin.so* acct_gather_energy_none.so* gres_gpu.so* mcs_group.so* sched_hold.so* acct_gather_energy_pm_counters.so* gres_mic.so* mcs_none.so* select_cons_res.so* acct_gather_energy_rapl.so* gres_mps.so* mcs_user.so* select_cons_tres.so* acct_gather_energy_xcc.so* gres_nic.so* mpi_none.so* select_linear.so* acct_gather_filesystem_lustre.so* jobacct_gather_cgroup.so* mpi_pmi2.so* site_factor_none.so* acct_gather_filesystem_none.so* jobacct_gather_linux.so* mpi_pmix.so@ slurmctld_nonstop.so* acct_gather_interconnect_none.so* jobacct_gather_none.so* mpi_pmix_v2.so* src/ acct_gather_interconnect_ofed.so* jobcomp_elasticsearch.so* node_features_knl_generic.so* switch_none.so* acct_gather_profile_hdf5.so* jobcomp_filetxt.so* power_none.so* task_affinity.so* acct_gather_profile_influxdb.so* jobcomp_lua.so* preempt_none.so* task_cgroup.so* acct_gather_profile_none.so* jobcomp_mysql.so* preempt_partition_prio.so* task_none.so* auth_munge.so* jobcomp_none.so* preempt_qos.so* topology_3d_torus.so* burst_buffer_generic.so* jobcomp_script.so* prep_script.so* topology_hypercube.so* cli_filter_lua.so* job_container_cncu.so* priority_basic.so* topology_none.so* cli_filter_none.so* job_container_none.so* priority_multifactor.so* topology_tree.so* cli_filter_syslog.so* job_submit_all_partitions.so* proctrack_cgroup.so* cli_filter_user_defaults.so* job_submit_lua.so* proctrack_linuxproc.so*
On Thu, Apr 15, 2021 at 9:02 AM Michael Di Domenico <mdidomeni...@gmail.com <mailto:mdidomeni...@gmail.com>> wrote:
    the error message sounds like when you built the slurm source it
    wasn't able to find the nvml devel packages.  if you look in where you
    installed slurm, in lib/slurm you should have a gpu_nvml.so.  do you?

    On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro
    <cristobal.navarr...@gmail.com
    <mailto:cristobal.navarr...@gmail.com>> wrote:
     >
     > typing error, should be --> **located at /usr/include/nvml.h**
     >
     > On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro
    <cristobal.navarr...@gmail.com
    <mailto:cristobal.navarr...@gmail.com>> wrote:
     >>
     >> Hi community,
     >> I have set up the configuration files as mentioned in the
    documentation, but the slurmd of the GPU-compute node fails with the
    following error shown in the log.
     >> After reading the slurm documentation, it is not entirely clear
    to me how to properly set up GPU autodetection for the gres.conf
    file as it does not mention if the nvml detection should be
    automatic or not.
     >> I have also read the top google searches including
    https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html
    <https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html>
    but that was a problem of a cuda installation overwritten (not my case).
     >> This a DGX A100 node that comes with the Nvidia driver installed
    and nvml is located at /etc/include/nvml.h, not sure if there is a
    libnvml.so or similar as well.
     >> How to tell SLURM to look at those paths? any ideas of
    experience sharing is welcome.
     >> best
     >>
     >>
     >> slurmd.log (GPU node)
     >> [2021-04-14T17:31:42.302] got shutdown request
     >> [2021-04-14T17:31:42.302] all threads complete
     >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to
    open '(null)/tasks' for reading : No such file or directory
     >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to
    get pids of '(null)'
     >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to
    open '(null)/tasks' for reading : No such file or directory
     >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to
    get pids of '(null)'
     >> [2021-04-14T17:31:42.304] debug:  gres/gpu: fini: unloading
     >> [2021-04-14T17:31:42.304] debug:  gpu/generic: fini: fini:
    unloading GPU Generic plugin
     >> [2021-04-14T17:31:42.304] select/cons_tres: common_fini:
    select/cons_tres shutting down ...
     >> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so:
    slurmd_exit = 0
     >> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential
    signature plugin unloaded
     >> [2021-04-14T17:31:42.304] Slurmd shutdown completing
     >> [2021-04-14T17:31:42.321] debug:  Log file re-opened
     >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init
     >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load
     >> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml
     >> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket
     >> [2021-04-14T17:31:42.446] debug:  CPUs:256 Boards:1 Sockets:8
    CoresPerSocket:16 ThreadsPerCore:2
     >> [2021-04-14T17:31:42.446] debug:  Reading cgroup.conf file
    /etc/slurm/cgroup.conf
     >> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init
     >> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml
    file (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found
     >> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket
     >> [2021-04-14T17:31:42.448] debug:  CPUs:256 Boards:1 Sockets:8
    CoresPerSocket:16 ThreadsPerCore:2
     >> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1)
     >> [2021-04-14T17:31:42.449] debug:  gres/gpu: init: loaded
     >> [2021-04-14T17:31:42.449] fatal: We were configured to
    autodetect nvml functionality, but we weren't able to find that lib
    when Slurm was configured.
     >>
     >>
     >>
     >> gres.conf (just AutoDetect=nvml)
     >> ➜  ~ cat /etc/slurm/gres.conf
     >> # GRES configuration for native GPUS
     >> # DGX A100 8x Nvidia A100
     >> # not working, slurm cannot find nvml
     >> AutoDetect=nvml
     >> #Name=gpu File=/dev/nvidia[0-7]
     >> #Name=gpu Type=A100 File=/dev/nvidia[0-7]
     >> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7
     >> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15
     >> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23
     >> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31
     >> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39
     >> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47
     >> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55
     >> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63
     >>
     >>
     >> slurm.conf
     >> GresTypes=gpu
     >> AccountingStorageTRES=gres/gpu
     >> DebugFlags=CPU_Bind,gres
     >>
     >> ## We don't want a node to go back in pool without sys admin
    acknowledgement
     >> ReturnToService=0
     >>
     >> ## Basic scheduling
     >> #SelectType=select/cons_res
     >> SelectType=select/cons_tres
     >> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
     >> SchedulerType=sched/backfill
     >>
     >> TaskPlugin=task/cgroup
     >> ProctrackType=proctrack/cgroup
     >>
     >> ## Nodes list
     >> ## use native GPUs
     >> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16
    ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:8
    Feature=ht,gpu
     >>
     >> ## Partitions list
     >> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8
    MaxTime=INFINITE State=UP Nodes=nodeGPU01  Default=YES
     >> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128
    MaxTime=INFINITE State=UP Nodes=nodeGPU01
     >> --
     >> Cristóbal A. Navarro
     >
     >
     >
     > --
     > Cristóbal A. Navarro



--
Cristóbal A. Navarro



-------------------------------------------------------------------
Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch
+4144 632 30 59  |  ETF D 104  |  Sternwartstrasse 7  | 8092 Zurich
-------------------------------------------------------------------

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [slurm-users] AutoDetect=nvml throwing an error message

Reply via email to