Follow up: I was able to fix my problem following advice in this post <https://blog.devops.dev/slurm-complete-guide-a-to-z-concepts-setup-and-trouble-shooting-for-admins-8dc5034ed65b> which said that the GPU GRES could be manually configured (no autodetect) by adding a line like this: 'NodeName=slurmnode Name=gpu File=/dev/nvidia0' to gres.conf
On Wed, Apr 3, 2024 at 4:30 PM Shooktija S N <shooktij...@gmail.com> wrote: > Hi, > > I am setting up Slurm on our lab's 3 node cluster and I have run into a > problem while adding GPUs (each node has an NVIDIA 4070 ti) as a GRES. > There is an error at the 'debug' log level in slurmd.log that says that the > GPU is file-less and is being removed from the final GRES list. This error > according to some older posts on this forum might be fixed by reinstalling > / reconfiguring Slurm with the right flag (the '--with-nvml' flag according > to this <https://groups.google.com/g/slurm-users/c/cvGb4JnK8BY> post). > > Line in /var/log/slurmd.log: > [2024-04-03T15:42:02.695] debug: Removing file-less GPU gpu:rtx4070 from > final GRES list > > Does this error require me to either reinstall / reconfigure Slurm? What > does 'reconfigure Slurm' mean? > I'm about as clueless as a caveman with a smartphone when it comes to > Slurm administration and Linux system administration in general. So, if you > could, please explain it to me as simply as possible. > > slurm.conf without comment lines: > ClusterName=DlabCluster > SlurmctldHost=server1 > GresTypes=gpu > ProctrackType=proctrack/linuxproc > ReturnToService=1 > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurmd.pid > SlurmdPort=6818 > SlurmdSpoolDir=/var/spool/slurmd > SlurmUser=root > StateSaveLocation=/var/spool/slurmctld > TaskPlugin=task/affinity,task/cgroup > InactiveLimit=0 > KillWait=30 > MinJobAge=300 > SlurmctldTimeout=120 > SlurmdTimeout=300 > Waittime=0 > SchedulerType=sched/backfill > SelectType=select/cons_tres > JobCompType=jobcomp/none > JobAcctGatherFrequency=30 > SlurmctldDebug=debug2 > SlurmctldLogFile=/var/log/slurmctld.log > SlurmdDebug=debug2 > SlurmdLogFile=/var/log/slurmd.log > NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 > ThreadsPerCore=2 State=UNKNOWN Gres=gpu:rtx4070:1 > PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP > > gres.conf (only one line): > AutoDetect=nvml > > While installing cuda, I know that nvml has been installed because of this > line in /var/log/cuda-installer.log: > [INFO]: Installing: cuda-nvml-dev > > Thanks! > > PS: I could've added this as a continuation to this post > <https://groups.google.com/g/slurm-users/c/p68dkeUoMmA>, but for some > reason I do not have permission to post to that group, so here I am > starting a new thread. >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com