Hi Sushil, Try changing NodeName specification to:
NodeName=localhost CPUs=96 State=UNKNOWN Gres=gpu*:8* Also: TaskPlugin=task/cgroup Best, Steve On Wed, Apr 6, 2022 at 9:56 AM Sushil Mishra <sushilbioi...@gmail.com> wrote: > Dear SLURM users, > > I am very new to alarm and need some help in configuring slurm in a single > node machine. This machine has 8x Nvidia GPUs and 96 core cpu. Vendor has > set up a "LocalQ" but thai somehow is running all the calculations in GPU > 0. If I submit 4 independent jobs at a time, it starts running all four > calculations on GPU 0. I want slurm to assign a specific GPU (setting a > CUDA_VISIBLE_DEVICE variable) for each job and before it starts running and > hold rest of the jobs in queue until a GPU becomes available. > > slurm.conf looks like: > > *$ cat /etc/slurm-llnl/slurm.conf * > ClusterName=localcluster > SlurmctldHost=localhost > MpiDefault=none > ProctrackType=proctrack/linuxproc > ReturnToService=2 > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurmd.pid > SlurmdPort=6818 > SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd > SlurmUser=slurm > StateSaveLocation=/var/lib/slurm-llnl/slurmctld > SwitchType=switch/none > TaskPlugin=task/none > # > GresTypes=gpu > #SlurmdDebug=debug2 > > # TIMERS > InactiveLimit=0 > KillWait=30 > MinJobAge=300 > SlurmctldTimeout=120 > SlurmdTimeout=300 > Waittime=0 > # SCHEDULING > SchedulerType=sched/backfill > SelectType=select/cons_tres > SelectTypeParameters=CR_Core > # > #AccountingStoragePort= > AccountingStorageType=accounting_storage/none > JobCompType=jobcomp/none > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/none > SlurmctldDebug=info > SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log > SlurmdDebug=info > SlurmdLogFile=/var/log/slurm-llnl/slurmd.log > # > # COMPUTE NODES > NodeName=localhost CPUs=96 State=UNKNOWN Gres=gpu > #NodeName=mannose NodeAddr=130.74.2.86 CPUs=1 State=UNKNOWN > > # Partitions list > PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=7-00:00:00 State=UP > #PartitionName=gpu_short MaxCPUsPerNode=32 DefMemPerNode=65556 > DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=01-00:00:00 > State=UP Nodes=localhost Default=YES > > and : > *$ cat /etc/slurm-llnl/gres.conf* > #detect GPUs > AutoDetect=nvlm > # GPU gres > NodeName=localhost Name=gpu File=/dev/nvidia0 > NodeName=localhost Name=gpu File=/dev/nvidia1 > NodeName=localhost Name=gpu File=/dev/nvidia2 > NodeName=localhost Name=gpu File=/dev/nvidia3 > NodeName=localhost Name=gpu File=/dev/nvidia4 > NodeName=localhost Name=gpu File=/dev/nvidia5 > NodeName=localhost Name=gpu File=/dev/nvidia6 > NodeName=localhost Name=gpu File=/dev/nvidia7 > > Best, > Sushil > > -- ________________________________________________________________ Steve Cousins Interim Director/Supercomputer Engineer Advanced Computing Group University of Maine System 244 Neville Hall (UMS Data Center) (207) 581-3574 Orono ME 04469 steve.cousins at maine.edu