[slurm-users] Re: How to reinstall / reconfigure Slurm?

Shooktija S N via slurm-users Mon, 08 Apr 2024 01:51:31 -0700

Follow up:
I was able to fix my problem following advice in this post
<https://blog.devops.dev/slurm-complete-guide-a-to-z-concepts-setup-and-trouble-shooting-for-admins-8dc5034ed65b>
which
said that the GPU GRES could be manually configured (no autodetect) by
adding a line like this: 'NodeName=slurmnode Name=gpu File=/dev/nvidia0' to
gres.conf


On Wed, Apr 3, 2024 at 4:30 PM Shooktija S N <shooktij...@gmail.com> wrote:

> Hi,
>
> I am setting up Slurm on our lab's 3 node cluster and I have run into a
> problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES.
> There is an error at the 'debug' log level in slurmd.log that says that the
> GPU is file-less and is being removed from the final GRES list. This error
> according to some older posts on this forum might be fixed by reinstalling
> / reconfiguring Slurm with the right flag (the '--with-nvml' flag according
> to this <https://groups.google.com/g/slurm-users/c/cvGb4JnK8BY> post).
>
> Line in /var/log/slurmd.log:
> [2024-04-03T15:42:02.695] debug:  Removing file-less GPU gpu:rtx4070 from
> final GRES list
>
> Does this error require me to either reinstall / reconfigure Slurm? What
> does 'reconfigure Slurm' mean?
> I'm about as clueless as a caveman with a smartphone when it comes to
> Slurm administration and Linux system administration in general. So, if you
> could, please explain it to me as simply as possible.
>
> slurm.conf without comment lines:
> ClusterName=DlabCluster
> SlurmctldHost=server1
> GresTypes=gpu
> ProctrackType=proctrack/linuxproc
> ReturnToService=1
> SlurmctldPidFile=/var/run/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/spool/slurmd
> SlurmUser=root
> StateSaveLocation=/var/spool/slurmctld
> TaskPlugin=task/affinity,task/cgroup
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
> SchedulerType=sched/backfill
> SelectType=select/cons_tres
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> SlurmctldDebug=debug2
> SlurmctldLogFile=/var/log/slurmctld.log
> SlurmdDebug=debug2
> SlurmdLogFile=/var/log/slurmd.log
> NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64
> ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1
> PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> gres.conf (only one line):
> AutoDetect=nvml
>
> While installing cuda, I know that nvml has been installed because of this
> line in /var/log/cuda-installer.log:
> [INFO]: Installing: cuda-nvml-dev
>
> Thanks!
>
> PS: I could've added this as a continuation to this post
> <https://groups.google.com/g/slurm-users/c/p68dkeUoMmA>, but for some
> reason I do not have permission to post to that group, so here I am
> starting a new thread.
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: How to reinstall / reconfigure Slurm?

Reply via email to