Hello,

try to comment out the line:

    AutoDetect=nvml

And then restart "slurmd" and "slurmctld".

Job allocations to the same GPU might be an effect of automatic MPS
configuration, thogugh I'm not sure for 100%:
https://slurm.schedmd.com/gres.html#MPS_Management

Kind Regards
--
Kamil Wilczek

W dniu 06.04.2022 o 15:53, Sushil Mishra pisze:
Dear SLURM users,

I am very new to alarm and need some help in configuring slurm in a single node machine. This machine has 8x Nvidia GPUs and 96 core cpu. Vendor has set up a "LocalQ" but thai somehow is running all the calculations in GPU 0. If I submit 4 independent jobs at a time, it starts running all four calculations on GPU 0. I want slurm to assign a specific GPU (setting a CUDA_VISIBLE_DEVICE variable) for each job and before it starts running and hold rest of the jobs in queue until a GPU becomes available.

slurm.conf looks like:
*$ cat /etc/slurm-llnl/slurm.conf
*
ClusterName=localcluster
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
GresTypes=gpu
#SlurmdDebug=debug2

# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
# COMPUTE NODES
NodeName=localhost CPUs=96 State=UNKNOWN Gres=gpu
#NodeName=mannose NodeAddr=130.74.2.86 CPUs=1 State=UNKNOWN

# Partitions list
PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=7-00:00:00 State=UP
#PartitionName=gpu_short  MaxCPUsPerNode=32 DefMemPerNode=65556 DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=01-00:00:00 State=UP Nodes=localhost  Default=YES

and :
*$ cat /etc/slurm-llnl/gres.conf*
#detect GPUs
AutoDetect=nvlm
# GPU gres
NodeName=localhost Name=gpu File=/dev/nvidia0
NodeName=localhost Name=gpu File=/dev/nvidia1
NodeName=localhost Name=gpu File=/dev/nvidia2
NodeName=localhost Name=gpu File=/dev/nvidia3
NodeName=localhost Name=gpu File=/dev/nvidia4
NodeName=localhost Name=gpu File=/dev/nvidia5
NodeName=localhost Name=gpu File=/dev/nvidia6
NodeName=localhost Name=gpu File=/dev/nvidia7

Best,
Sushil


--
Kamil Wilczek  [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]
Laboratorium Komputerowe
Wydział Matematyki, Informatyki i Mechaniki
Uniwersytet Warszawski

ul. Banacha 2
02-097 Warszawa

Tel.: 22 55 44 392
https://www.mimuw.edu.pl/
https://www.uw.edu.pl/

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to