Hello, try to comment out the line:
AutoDetect=nvml And then restart "slurmd" and "slurmctld". Job allocations to the same GPU might be an effect of automatic MPS configuration, thogugh I'm not sure for 100%: https://slurm.schedmd.com/gres.html#MPS_Management Kind Regards -- Kamil Wilczek W dniu 06.04.2022 o 15:53, Sushil Mishra pisze:
Dear SLURM users,I am very new to alarm and need some help in configuring slurm in a single node machine. This machine has 8x Nvidia GPUs and 96 core cpu. Vendor has set up a "LocalQ" but thai somehow is running all the calculations in GPU 0. If I submit 4 independent jobs at a time, it starts running all four calculations on GPU 0. I want slurm to assign a specific GPU (setting a CUDA_VISIBLE_DEVICE variable) for each job and before it starts running and hold rest of the jobs in queue until a GPU becomes available.slurm.conf looks like: *$ cat /etc/slurm-llnl/slurm.conf * ClusterName=localcluster SlurmctldHost=localhost MpiDefault=none ProctrackType=proctrack/linuxproc ReturnToService=2 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/none # GresTypes=gpu #SlurmdDebug=debug2 # TIMERS InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core # #AccountingStoragePort= AccountingStorageType=accounting_storage/none JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm-llnl/slurmd.log # # COMPUTE NODES NodeName=localhost CPUs=96 State=UNKNOWN Gres=gpu #NodeName=mannose NodeAddr=130.74.2.86 CPUs=1 State=UNKNOWN # Partitions list PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=7-00:00:00 State=UP#PartitionName=gpu_short MaxCPUsPerNode=32 DefMemPerNode=65556 DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=01-00:00:00 State=UP Nodes=localhost Default=YESand : *$ cat /etc/slurm-llnl/gres.conf* #detect GPUs AutoDetect=nvlm # GPU gres NodeName=localhost Name=gpu File=/dev/nvidia0 NodeName=localhost Name=gpu File=/dev/nvidia1 NodeName=localhost Name=gpu File=/dev/nvidia2 NodeName=localhost Name=gpu File=/dev/nvidia3 NodeName=localhost Name=gpu File=/dev/nvidia4 NodeName=localhost Name=gpu File=/dev/nvidia5 NodeName=localhost Name=gpu File=/dev/nvidia6 NodeName=localhost Name=gpu File=/dev/nvidia7 Best, Sushil
-- Kamil Wilczek [https://keys.openpgp.org/] [D415917E84B8DA5A60E853B6E676ED061316B69B] Laboratorium Komputerowe Wydział Matematyki, Informatyki i Mechaniki Uniwersytet Warszawski ul. Banacha 2 02-097 Warszawa Tel.: 22 55 44 392 https://www.mimuw.edu.pl/ https://www.uw.edu.pl/
OpenPGP_signature
Description: OpenPGP digital signature