We run Bright 8.1 and Slurm 17.11. We are trying to allow for multiple concurrent jobs to run on our small 4 node cluster.
Based on https://community.brightcomputing.com/question/5d6614ba08e8e81e885f1991?action=artikel&cat=14&id=410&artlang=en&highlight=slurm+%2526%252334%253Bgang+scheduling%2526%252334%253B and https://slurm.schedmd.com/cons_res_share.html Here are some settings in /etc/slurm/slurm.conf: SchedulerType=sched/backfill # Nodes NodeName=node[001-003] CoresPerSocket=12 RealMemory=191800 Sockets=2 Gres=gpu:1 # Partitions PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP Nodes=node[001-003] PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime= 0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP # Generic resources types GresTypes=gpu,mic # Epilog/Prolog parameters PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob Prolog=/cm/local/apps/cmd/scripts/prolog Epilog=/cm/local/apps/cmd/scripts/epilog # Fast Schedule option FastSchedule=1 # Power Saving SuspendTime=-1 # this disables power saving SuspendTimeout=30 ResumeTimeout=60 SuspendProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweroff ResumeProgram=/cm/local/apps/cluster-tools/wlm/scripts/slurmpoweron # END AUTOGENERATED SECTION -- DO NOT REMOVE # http://kb.brightcomputing.com/faq/index.php?action=artikel&cat=14&id=410&artlang=en&highlight=slurm+%26%2334%3Bgang+scheduling%26%2334%3B SelectType=select/cons_res SelectTypeParameters=CR_CPU SchedulerTimeSlice=60 EnforcePartLimits=YES But it appears each job takes 1 of the 3 nodes and all other jobs are back scheduled. Do we have an incorrect option set? squeue -a JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1937 defq PaNet5 user1 PD 0:00 1 (Resources) 1938 defq PoNet5 user1 PD 0:00 1 (Priority) 1964 defq SENet5 user1 PD 0:00 1 (Priority) 1979 defq IcNet5 user1 PD 0:00 1 (Priority) 1980 defq runtrain user2 PD 0:00 1 (Priority) 1981 defq InRes5 user1 PD 0:00 1 (Priority) 1983 defq run_LSTM user3 PD 0:00 1 (Priority) 1984 defq run_hui. user4 PD 0:00 1 (Priority) 1936 defq SeRes5 user1 R 10:02:39 1 node003 1950 defq sequenti user5 R 1-02:03:00 1 node001 1978 defq run_hui. user16 R 13:48:21 1 node002 Am I misunderstanding some of the settings?