Assuming that you have the cuda drivers installed correctly (nvidia-smi for instance), You should create a gres.conf with just this line:
> AutoDetect=nvml If that doesn’t automagically begin working, you can increase the verbosity of slurmd with > SlurmdDebug=debug2 It should then print a bunch of logs describing any gpu’s that are found. You may need to alter the name from RTX4070TI (which is wordy as is). I’m not sure just how lax the matching engine of slurm and the nvml interface are with matching strings. Hope that helps, Reed > On Apr 2, 2024, at 6:08 AM, Shooktija S N via slurm-users > <slurm-users@lists.schedmd.com> wrote: > > Hi, > > I am trying to set up Slurm (version 22.05) on a 3 node cluster each having > an NVIDIA GeForce RTX 4070 Ti GPU. > I tried to follow along with the GRES setup tutorial on the Schedmd website > and added the following (Gres=gpu:RTX4070TI:1) to the Node configuration in > /etc/slurm/slurm.conf: > > NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 > ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1 > > I do not have a gres.conf. > However, I see this line at the debug log level in /var/log/slurmd.log: > > [2024-04-02T15:57:19.022] debug: Removing file-less GPU gpu:RTX4070TI from > final GRES list > > What other configs are necessary for Slurm to work with my GPU? > > More information: > OS: Proxmox VE 8.1.4 > Kernel: 6.5.13 > CPU: AMD EPYC 7662 > Memory: 128636MiB > > /etc/slurm/slurm.conf that's shared by all the 3 nodes without the comment > lines: > > ClusterName=DlabCluster > SlurmctldHost=server1 > GresTypes=gpu > ProctrackType=proctrack/linuxproc > ReturnToService=1 > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurmd.pid > SlurmdPort=6818 > SlurmdSpoolDir=/var/spool/slurmd > SlurmUser=root > StateSaveLocation=/var/spool/slurmctld > TaskPlugin=task/affinity,task/cgroup > InactiveLimit=0 > KillWait=30 > MinJobAge=300 > SlurmctldTimeout=120 > SlurmdTimeout=300 > Waittime=0 > SchedulerType=sched/backfill > SelectType=select/cons_tres > JobCompType=jobcomp/none > JobAcctGatherFrequency=30 > SlurmctldDebug=debug > SlurmctldLogFile=/var/log/slurmctld.log > SlurmdDebug=debug > SlurmdLogFile=/var/log/slurmd.log > NodeName=server[1-3] RealMemory=128636 Sockets=1 CoresPerSocket=64 > ThreadsPerCore=2 State=UNKNOWN Gres=gpu:RTX4070TI:1 > PartitionName=mainPartition Nodes=ALL Default=YES MaxTime=INFINITE State=UP > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com