Slurm source code should be downloaded and recompiled including the 
configuration flag - with-nvml.


As an example, using rpmbuild mechanism for recompiling and generating rpms, 
this is our current method.  Be aware that the compile works only if it finds 
the prerequisites needed for a given option on the host. (* e.g. to recompile 
this -with-nvml you should do so on a functioning gpu host *)

========

export VERSION=23.11.5


wget https://download.schedmd.com/slurm/slurm-$VERSION.tar.bz2
#
rpmbuild --define="_with_nvml --with-nvml=/usr" --define="_with_pam 
--with-pam=/usr" --define="_with_pmix --with-pmix=/usr" --define="_with_hdf5 
--without-hdf5" --define="_with_ofed --without-ofed" 
--define="_with_http_parser --with-http-parser=/usr/lib64" --define="_with_yaml 
 --define="_with_jwt  --define="_with_slurmrestd --with-slurmrestd=1" -ta 
slurm-$VERSION.tar.bz2 > build.log-$VERSION-`date +%F` 2>&1


This is a list of packages we ensure are installed on a given node when running 
this compile .

    - pkgs:
      - bzip2
      - cuda-nvml-devel-12-2
      - dbus-devel
      - freeipmi
      - freeipmi-devel
      - gcc
      - gtk2-devel
      - hwloc-devel
      - libjwt-devel
      - libssh2-devel
      - libyaml-devel
      - lua-devel
      - make
      - mariadb-devel
      - munge-devel
      - munge-libs
      - ncurses-devel
      - numactl-devel
      - openssl-devel
      - pam-devel
      - perl
      - perl-ExtUtils-MakeMaker
      - readline-devel
      - rpm-build
      - rpmdevtools
      - rrdtool-devel
      - http-parser-devel
      - json-c-devel

From: Shooktija S N via slurm-users <slurm-users@lists.schedmd.com>
Sent: Wednesday, April 3, 2024 7:01 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] How to reinstall / reconfigure Slurm?

Hi,

I am setting up Slurm on our lab's 3 node cluster and I have run into a problem 
while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES. There is an 
error at the 'debug' log level in slurmd.log that says that the GPU is 
file-less and is being removed from the final GRES list. This error according 
to some older posts on this forum might be fixed by reinstalling / 
reconfiguring Slurm with the right flag (the '--with-nvml' flag according to 
this<https://groups.google.com/g/slurm-users/c/cvGb4JnK8BY> post).

Line in /var/log/slurmd.log:
[2024-04-03T15:42:02.695] debug:  Removing file-less GPU gpu:rtx4070 from final 
GRES list

Does this error require me to either reinstall / reconfigure Slurm? What does 
'reconfigure Slurm' mean?
I'm about as clueless as a caveman with a smartphone when it comes to Slurm 
administration and Linux system administration in general. So, if you could, 
please explain it to me as simply as possible.

slurm.conf without comment lines:
ClusterName=DlabCluster
SlurmctldHost=server1
GresTypes=gpu
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=debug2
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=debug2
SlurmdLogFile=/var/log/slurmd.log
NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 
ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1
PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP

gres.conf (only one line):
AutoDetect=nvml

While installing cuda, I know that nvml has been installed because of this line 
in /var/log/cuda-installer.log:
[INFO]: Installing: cuda-nvml-dev

Thanks!

PS: I could've added this as a continuation to this 
post<https://groups.google.com/g/slurm-users/c/p68dkeUoMmA>, but for some 
reason I do not have permission to post to that group, so here I am starting a 
new thread.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to