[slurm-users] Hi-prio jobs are bypassed by low-prio jobs

2023-05-09 Thread Michał Kadlof

Hi,

A few tasks with higher priority give way to tasks with lower priority, 
and I don't understand why.


I noticed that the hi-prio tasks require 4 or 8 x GPUs on a single node, 
while the bypassing tasks only use 1 x GPU, but I'm not sure if it's 
related. High-priority tasks have a specific value in the StartTime 
field, but regularly, this value is pushed back to a later time. It 
seems like after finishing a 1GPU task, Slurm immediately schedules 
another 1GPU task instead of waiting for the release of the remaining 3 
or 7 GPUs for a higher-priority task. What can be wrong?


The tasks are being launched in the 'long' partition with QoS named long.

PartitionName=long
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=long
   DefaultTime=2-00:00:00 DisableRootJobs=NO ExclusiveUser=NO 
GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=10-00:00:00 MinNodes=0 LLN=NO 
MaxCPUsPerNode=UNLIMITED

   Nodes=dgx-[1-4],sr-[1-3]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO 
OverSubscribe=NO

   OverTimeLimit=NONE PreemptMode=SUSPEND
   State=UP TotalCPUs=656 TotalNodes=7 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
TRES=cpu=656,mem=8255731M,node=7,billing=3474,gres/gpu=32,gres/gpu:a100=32
   TRESBillingWeights=CPU=1,Mem=0.062G,GRES/gpu=72.458

  Name GrpTRES
-- ---
normal
  long  cpu=450,gres/gpu=28,mem=5T


Example of bypassed job with obscured sensitive data:

$ scontrol show job 649800
JobId=649800 JobName=train with motif
   UserId=XX() GroupId=XX(X) MCS_label=N/A
   Priority=275000 Nice=0 Account=sfglab QOS=normal
   JobState=PENDING Reason=QOSGrpGRES Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=03:00:00 TimeMin=N/A
   SubmitTime=2023-04-24T12:09:19 EligibleTime=2023-04-24T12:09:19
   AccrueTime=2023-04-24T12:09:19
   StartTime=2023-05-11T06:30:00 EndTime=2023-05-11T09:30:00 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-05-09T13:16:46 
Scheduler=Backfill:*

   Partition=long AllocNode:Sid=0.0.0.0:379113
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=
   NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:*
   TRES=cpu=16,mem=32G,node=1,billing=16,gres/gpu=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=16 MinMemoryNode=32G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/X
   WorkDir=/X
   StdErr=/XX
   StdIn=/dev/null
   StdOut=/X
   Power=
   TresPerNode=gres:gpu:8

--
best regards | pozdrawiam serdecznie
*Michał Kadlof*
Head of the high performance computing center 	Kierownik ośrodka 
obliczeniowego HPC

Eden^N cluster administratorAdministrator klastra obliczeniowego Eden^N
Structural and Functional Genomics Laboratory 	Laboratorium Genomiki 
Strukturalnej i Funkcjonalnej
Faculty of Mathematics and Computer Science 	Wydział Matematyki i Nauk 
Informacyjnych

Warsaw University of Technology Politechnika Warszawska


smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-users] Job scheduling bug?

2023-05-09 Thread Luke Sudbery
We recently upgraded from 20.11.9 to 22.05.8 and appear to have a problem with 
jobs not being scheduled on nodes with free resources since then.

It particularly noticeable on one particular partition with only one GPU node 
in it. Jobs queuing for this node are the highest priority in the queue at the 
moment, and the node is idle, but the job does not start:


[sudberlr-admin@bb-er-slurm01 ~]$ squeue -p broadwell-gpum60-ondemand --format 
"%.18i %.9P %.2t %.10M %.6D %30R %Q"

 JOBID PARTITION ST   TIME  NODES NODELIST(REASON)  
 PRIORITY

  66631657 broadwell PD   0:00  1 (Resources)   
 230

  66609948 broadwell PD   0:00  1 (Resources)   
 203

[sudberlr-admin@bb-er-slurm01 ~]$ squeue --format "%Q %i" --sort -Q | head -4

PRIORITY JOBID

230 66631657

212 66622378

210 66322847

[sudberlr-admin@bb-er-slurm01 ~]$ scontrol show node bear-pg0212u17b

NodeName=bear-pg0212u17b Arch=x86_64 CoresPerSocket=10

   CPUAlloc=0 CPUEfctv=20 CPUTot=20 CPULoad=0.01

   AvailableFeatures=haswell

   ActiveFeatures=haswell

   Gres=gpu:m60:2(S:0-1)

   NodeAddr=bear-pg0212u17b NodeHostName=bear-pg0212u17b Version=22.05.8

   OS=Linux 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021

   RealMemory=511000 AllocMem=0 FreeMem=501556 Sockets=2 Boards=1

   MemSpecLimit=501

   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

   Partitions=broadwell-gpum60-ondemand,system

   BootTime=2023-04-25T08:24:10 SlurmdStartTime=2023-05-04T11:57:46

   LastBusyTime=2023-05-09T13:27:07

   CfgTRES=cpu=20,mem=511000M,billing=20,gres/gpu=2

   AllocTRES=

   CapWatts=n/a

   CurrentWatts=0 AveWatts=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s



[sudberlr-admin@bb-er-slurm01 ~]$

The resources it requests easily met by the node:


[sudberlr-admin@bb-er-slurm01 ~]$ scontrol show job 66631657

JobId=66631657 JobName=sys/dashboard/sys/bc_uob_paraview

   UserId=(633299) GroupId=users(100) MCS_label=N/A

   Priority=230 Nice=0 Account= QOS=bbondemand

   JobState=PENDING Reason=Resources Dependency=(null)

   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   RunTime=00:00:00 TimeLimit=02:00:00 TimeMin=N/A

   SubmitTime=2023-05-09T13:27:31 EligibleTime=2023-05-09T13:27:31

   AccrueTime=2023-05-09T13:27:31

   StartTime=Unknown EndTime=Unknown Deadline=N/A

   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-05-09T16:02:30 
Scheduler=Main

   
Partition=broadwell-gpum60-ondemand,cascadelake-hdr-ondemand,cascadelake-hdr-ondemand2
 AllocNode:Sid=localhost:1120095

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=

   NumNodes=1-1 NumCPUs=8 NumTasks=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

   TRES=cpu=8,mem=32G,node=1,billing=8,gres/gpu=1

   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

   MinCPUsNode=1 MinMemoryCPU=4G MinTmpDiskNode=0

   Features=(null) DelayBoot=00:00:00

   OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)

   Command=(null)

   WorkDir=/X

   StdErr=/X/output.log

   StdIn=/dev/null

   StdOut=/X/output.log

   Power=

   TresPerNode=gres:gpu:1





[sudberlr-admin@bb-er-slurm01 ~]$

This looks a bug to me because it was working fine before the upgrade and a 
simple restart of the slurm controller will often allow the jobs to start, 
without any other changes:


[sudberlr-admin@bb-er-slurm01 ~]$ squeue -p broadwell-gpum60-ondemand --format 
"%.18i %.9P %.2t %.10M %.6D %32R %Q"

 JOBID PARTITION ST   TIME  NODES NODELIST(REASON)  
   PRIORITY

  66631657 broadwell PD   0:00  1 (Resources)   
   230

  66609948 broadwell PD   0:00  1 (Resources)   
   203

[sudberlr-admin@bb-er-slurm01 ~]$ sudo systemctl restart slurmctld; sleep 30; 
squeue -p broadwell-gpum60-ondemand --format "%.18i %.9P %.2t %.10M %.6D %32R 
%Q"

Job for slurmctld.service canceled.

 JOBID PARTITION ST   TIME  NODES NODELIST(REASON)  
   PRIORITY

  66631657 broadwell  R   0:04  1 bear-pg0212u17b   
   230

  66609948 broadwell  R   0:04  1 bear-pg0212u17b   
   203

[sudberlr-admin@bb-er-slurm01 ~]$



Has anyone come across this behaviour or have any other ideas?

Many thanks,

Luke

--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don't work on Monday.