Hi Michael, Thanks, Indeed I don't have it. Slurm must have not detected it. I double checked and NVML is installed (libnvidia-ml-dev for Ubuntu) Here is some output, including the relevant paths for nvml. Is it possible to tell the slurm compilation to check these paths for nvml ? best
*NVML PKG CHECK* ➜ ~ sudo apt search nvml Sorting... Done Full Text Search... Done cuda-nvml-dev-11-0/unknown 11.0.167-1 amd64 NVML native dev links, headers cuda-nvml-dev-11-1/unknown,unknown 11.1.74-1 amd64 NVML native dev links, headers cuda-nvml-dev-11-2/unknown,unknown 11.2.152-1 amd64 NVML native dev links, headers *libnvidia-ml-dev/focal,now 10.1.243-3 amd64 [installed] NVIDIA Management Library (NVML) development files* python3-pynvml/focal 7.352.0-3 amd64 Python3 bindings to the NVIDIA Management Library *NVML Shared library location* ➜ ~ find /usr/lib | grep libnvidia-ml /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.102.04 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so *NVML Header* ➜ ~ find /usr | grep nvml /usr/include/nvml.h *SLURM LIBS* ➜ ~ ls /usr/lib64/slurm/ accounting_storage_mysql.so* core_spec_none.so* job_submit_pbs.so* proctrack_pgid.so* accounting_storage_none.so* cred_munge.so* job_submit_require_timelimit.so* route_default.so* accounting_storage_slurmdbd.so* cred_none.so* job_submit_throttle.so* route_topology.so* acct_gather_energy_ibmaem.so* ext_sensors_none.so* launch_slurm.so* sched_backfill.so* acct_gather_energy_ipmi.so* gpu_generic.so* mcs_account.so* sched_builtin.so* acct_gather_energy_none.so* gres_gpu.so* mcs_group.so* sched_hold.so* acct_gather_energy_pm_counters.so* gres_mic.so* mcs_none.so* select_cons_res.so* acct_gather_energy_rapl.so* gres_mps.so* mcs_user.so* select_cons_tres.so* acct_gather_energy_xcc.so* gres_nic.so* mpi_none.so* select_linear.so* acct_gather_filesystem_lustre.so* jobacct_gather_cgroup.so* mpi_pmi2.so* site_factor_none.so* acct_gather_filesystem_none.so* jobacct_gather_linux.so* mpi_pmix.so@ slurmctld_nonstop.so* acct_gather_interconnect_none.so* jobacct_gather_none.so* mpi_pmix_v2.so* src/ acct_gather_interconnect_ofed.so* jobcomp_elasticsearch.so* node_features_knl_generic.so* switch_none.so* acct_gather_profile_hdf5.so* jobcomp_filetxt.so* power_none.so* task_affinity.so* acct_gather_profile_influxdb.so* jobcomp_lua.so* preempt_none.so* task_cgroup.so* acct_gather_profile_none.so* jobcomp_mysql.so* preempt_partition_prio.so* task_none.so* auth_munge.so* jobcomp_none.so* preempt_qos.so* topology_3d_torus.so* burst_buffer_generic.so* jobcomp_script.so* prep_script.so* topology_hypercube.so* cli_filter_lua.so* job_container_cncu.so* priority_basic.so* topology_none.so* cli_filter_none.so* job_container_none.so* priority_multifactor.so* topology_tree.so* cli_filter_syslog.so* job_submit_all_partitions.so* proctrack_cgroup.so* cli_filter_user_defaults.so* job_submit_lua.so* proctrack_linuxproc.so* On Thu, Apr 15, 2021 at 9:02 AM Michael Di Domenico <mdidomeni...@gmail.com> wrote: > the error message sounds like when you built the slurm source it > wasn't able to find the nvml devel packages. if you look in where you > installed slurm, in lib/slurm you should have a gpu_nvml.so. do you? > > On Wed, Apr 14, 2021 at 5:53 PM Cristóbal Navarro > <cristobal.navarr...@gmail.com> wrote: > > > > typing error, should be --> **located at /usr/include/nvml.h** > > > > On Wed, Apr 14, 2021 at 5:47 PM Cristóbal Navarro < > cristobal.navarr...@gmail.com> wrote: > >> > >> Hi community, > >> I have set up the configuration files as mentioned in the > documentation, but the slurmd of the GPU-compute node fails with the > following error shown in the log. > >> After reading the slurm documentation, it is not entirely clear to me > how to properly set up GPU autodetection for the gres.conf file as it does > not mention if the nvml detection should be automatic or not. > >> I have also read the top google searches including > https://lists.schedmd.com/pipermail/slurm-users/2020-February/004832.html > but that was a problem of a cuda installation overwritten (not my case). > >> This a DGX A100 node that comes with the Nvidia driver installed and > nvml is located at /etc/include/nvml.h, not sure if there is a libnvml.so > or similar as well. > >> How to tell SLURM to look at those paths? any ideas of experience > sharing is welcome. > >> best > >> > >> > >> slurmd.log (GPU node) > >> [2021-04-14T17:31:42.302] got shutdown request > >> [2021-04-14T17:31:42.302] all threads complete > >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open > '(null)/tasks' for reading : No such file or directory > >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids > of '(null)' > >> [2021-04-14T17:31:42.303] debug2: _file_read_uint32s: unable to open > '(null)/tasks' for reading : No such file or directory > >> [2021-04-14T17:31:42.303] debug2: xcgroup_get_pids: unable to get pids > of '(null)' > >> [2021-04-14T17:31:42.304] debug: gres/gpu: fini: unloading > >> [2021-04-14T17:31:42.304] debug: gpu/generic: fini: fini: unloading > GPU Generic plugin > >> [2021-04-14T17:31:42.304] select/cons_tres: common_fini: > select/cons_tres shutting down ... > >> [2021-04-14T17:31:42.304] debug2: spank: spank_pyxis.so: slurmd_exit = 0 > >> [2021-04-14T17:31:42.304] cred/munge: fini: Munge credential signature > plugin unloaded > >> [2021-04-14T17:31:42.304] Slurmd shutdown completing > >> [2021-04-14T17:31:42.321] debug: Log file re-opened > >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_init > >> [2021-04-14T17:31:42.321] debug2: hwloc_topology_load > >> [2021-04-14T17:31:42.440] debug2: hwloc_topology_export_xml > >> [2021-04-14T17:31:42.446] Considering each NUMA node as a socket > >> [2021-04-14T17:31:42.446] debug: CPUs:256 Boards:1 Sockets:8 > CoresPerSocket:16 ThreadsPerCore:2 > >> [2021-04-14T17:31:42.446] debug: Reading cgroup.conf file > /etc/slurm/cgroup.conf > >> [2021-04-14T17:31:42.447] debug2: hwloc_topology_init > >> [2021-04-14T17:31:42.447] debug2: xcpuinfo_hwloc_topo_load: xml file > (/var/spool/slurmd/slurmd/hwloc_topo_whole.xml) found > >> [2021-04-14T17:31:42.448] Considering each NUMA node as a socket > >> [2021-04-14T17:31:42.448] debug: CPUs:256 Boards:1 Sockets:8 > CoresPerSocket:16 ThreadsPerCore:2 > >> [2021-04-14T17:31:42.449] GRES: Global AutoDetect=nvml(1) > >> [2021-04-14T17:31:42.449] debug: gres/gpu: init: loaded > >> [2021-04-14T17:31:42.449] fatal: We were configured to autodetect nvml > functionality, but we weren't able to find that lib when Slurm was > configured. > >> > >> > >> > >> gres.conf (just AutoDetect=nvml) > >> ➜ ~ cat /etc/slurm/gres.conf > >> # GRES configuration for native GPUS > >> # DGX A100 8x Nvidia A100 > >> # not working, slurm cannot find nvml > >> AutoDetect=nvml > >> #Name=gpu File=/dev/nvidia[0-7] > >> #Name=gpu Type=A100 File=/dev/nvidia[0-7] > >> #Name=gpu Type=A100 File=/dev/nvidia0 Cores=0-7 > >> #Name=gpu Type=A100 File=/dev/nvidia1 Cores=8-15 > >> #Name=gpu Type=A100 File=/dev/nvidia2 Cores=16-23 > >> #Name=gpu Type=A100 File=/dev/nvidia3 Cores=24-31 > >> #Name=gpu Type=A100 File=/dev/nvidia4 Cores=32-39 > >> #Name=gpu Type=A100 File=/dev/nvidia5 Cores=40-47 > >> #Name=gpu Type=A100 File=/dev/nvidia6 Cores=48-55 > >> #Name=gpu Type=A100 File=/dev/nvidia7 Cores=56-63 > >> > >> > >> slurm.conf > >> GresTypes=gpu > >> AccountingStorageTRES=gres/gpu > >> DebugFlags=CPU_Bind,gres > >> > >> ## We don't want a node to go back in pool without sys admin > acknowledgement > >> ReturnToService=0 > >> > >> ## Basic scheduling > >> #SelectType=select/cons_res > >> SelectType=select/cons_tres > >> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE > >> SchedulerType=sched/backfill > >> > >> TaskPlugin=task/cgroup > >> ProctrackType=proctrack/cgroup > >> > >> ## Nodes list > >> ## use native GPUs > >> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 > RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu > >> > >> ## Partitions list > >> PartitionName=gpu OverSubscribe=FORCE DefCpuPerGPU=8 MaxTime=INFINITE > State=UP Nodes=nodeGPU01 Default=YES > >> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=128 > MaxTime=INFINITE State=UP Nodes=nodeGPU01 > >> -- > >> Cristóbal A. Navarro > > > > > > > > -- > > Cristóbal A. Navarro > > -- Cristóbal A. Navarro