I'm perplexed. My cluster has been churning along and tonight it has decided to 
start pending jobs even though there are plenty of nodes available.

An example job from squeue:

            JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
            409978 interacti    verdi amirinen PD       0:00      1 (Resources)
            409989   regress update_r  jenkins PD       0:00      1 (Nodes 
required for job are DOWN, DRAINED or reserved for jobs in higher priority 
partitions)
            409985   regress update_r amirinen PD       0:00      1 (Nodes 
required for job are DOWN, DRAINED or reserved for jobs in higher priority 
partitions)
            409982   regress update_r akshabal PD       0:00      1 (Nodes 
required for job are DOWN, DRAINED or reserved for jobs in higher priority 
partitions)
            409994   regress SYN__tpb kumarbck PD       0:00      1 (Nodes 
required for job are DOWN, DRAINED or reserved for jobs in higher priority 
partitions)
            409999 interacti sbatch_w akshabal PD       0:00      1 (Priority)
            410000   regress ICC2__tp  gadikon PD       0:00      1 (Nodes 
required for job are DOWN, DRAINED or reserved for jobs in higher priority 
partitions)
            410005   regress update_r amirinen PD       0:00      1 (Nodes 
required for job are DOWN, DRAINED or reserved for jobs in higher priority 
partitions)
            410003   regress update_r bachchuk PD       0:00      1 (Nodes 
required for job are DOWN, DRAINED or reserved for jobs in higher priority 
partitions)
            410006   regress update_r saurahuj PD       0:00      1 (Nodes 
required for job are DOWN, DRAINED or reserved for jobs in higher priority 
partitions)
            410009   regress xterm_fi  gadikon PD       0:00      1 (Nodes 
required for job are DOWN, DRAINED or reserved for jobs in higher priority 
partitions)
            410010   regress ICC2__tp  gadikon PD       0:00      1 (Nodes 
required for job are DOWN, DRAINED or reserved for jobs in higher priority 
partitions)
            410001   regress ICC2__tp  gadikon PD       0:00      1 (Dependency)
            410002   regress ICC2__tp  gadikon PD       0:00      1 (Dependency)
            410004   regress ICC2__tp  gadikon PD       0:00      1 (Dependency)
            410011   regress ICC2__tp  gadikon PD       0:00      1 (Dependency)
            410014   regress ICC2__tp  gadikon PD       0:00      1 (Dependency)
            410015   regress ICC2__tp  gadikon PD       0:00      1 (Dependency)
            409937 interacti    verdi   nsamra  R    5:51:10      1 
c7-c5n-18xl-3

The output of sinfo shows plenty of nodes available for the scheduler.

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all          up   infinite  31954  idle~ 
al2-t3-2xl-[0-999],al2-t3-l-[0-999],c7-c5-24xl-[0-5,7-10,14,16-17,19,21-46,48-151,153-155,157-164,167,169-485,487-999],c7-c5d-24xl-[0,2-999],c7-c5n-18xl-[0-2,4-14,16-26,28-44,46,48-51,53-54,56-63,65-67,69-72,74-75,77-82,84,86-99,101-999],c7-m5-24xl-[0-325,327-999],c7-m5d-24xl-[0-191,193-999],c7-m5dn-24xl-[0-3,5-97,99-999],c7-m5n-24xl-[0-24,26-999],c7-r5d-16xl-[0-3,5-999],c7-r5d-24xl-[1-16,18-999],c7-r5dn-24xl-[0-1,3-999],c7-t3-2xl-[0-8,10-970,973-999],c7-t3-l-[0-999],c7-x1-32xl-[0-6,8-999],c7-x1e-32xl-[0-999],c7-z1d-12xl-[0,2-5,7,9-10,12-999],rh7-c5-24xl-[0-999],rh7-c5d-24xl-[0-999],rh7-c5n-18xl-[0-999],rh7-m5-24xl-[0-999],rh7-m5d-24xl-[0-999],rh7-m5dn-24xl-[0-999],rh7-m5n-24xl-[0-999],rh7-r5d-16xl-[0-999],rh7-r5d-24xl-[0-999],rh7-r5dn-24xl-[0-999],rh7-t3-2xl-[0-999],rh7-t3-l-[0-999],rh7-x1-32xl-[0-999],rh7-x1e-32xl-[0-999],rh7-z1d-12xl-[0-999]
all          up   infinite      2  drain c7-t3-l-s-0,rh7-t3-l-s-0
all          up   infinite     46    mix 
c7-c5-24xl-[6,11-13,15,18,20,47,152,156,165-166,168,486],c7-c5d-24xl-1,c7-c5n-18xl-[3,15,27,45,47,52,55,64,68,73,76,83,85,100],c7-m5-24xl-326,c7-m5d-24xl-192,c7-m5dn-24xl-[4,98],c7-m5n-24xl-25,c7-r5d-16xl-4,c7-r5d-24xl-[0,17],c7-r5dn-24xl-2,c7-t3-2xl-[9,971-972],c7-x1-32xl-7,c7-z1d-12xl-[1,6,8,11]
all          up   infinite      1   idle al2-t3-l-s-0

The job isn't requesting anything special. Just 1 core and 1G of memory.

Any thoughts on why the scheduler would just stop scheduling jobs? This cluster 
is running on AWS and it's my intention to provide enough nodes so that jobs 
never queue and so far it's been working until now.

I've tried restarting slurmctld with an increased logging level, but no 
progress.

I see the following messages in slurmctld.log

[2020-03-29T21:55:58.951] debug:  sched: Running job scheduler
[2020-03-29T21:55:58.953] debug:  sched: JobId=409999. State=PENDING. 
Reason=Priority, Priority=100013. Partition=interactive.
[2020-03-29T21:56:58.932] debug:  sched: Running job scheduler
[2020-03-29T21:56:58.934] debug:  sched: JobId=409999. State=PENDING. 
Reason=Priority, Priority=100013. Partition=interactive.

The output of scontrol for this job is:

JobId=409999 JobName=sbatch_wrap.sh
   UserId=akshabal(67674) GroupId=domain_users(66049) MCS_label=N/A
   Priority=100013 Nice=0 Account=(null) QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2020-03-29T19:34:09 EligibleTime=2020-03-29T19:34:09
   AccrueTime=2020-03-29T19:34:09
   StartTime=2020-03-29T21:51:27 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-03-29T21:50:58
   Partition=interactive 
AllocNode:Sid=a-2vaol6a8g9ca8.mla.annapurna.aws.a2z.com:7549
  ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1 NumCPUs=8 NumTasks=1 CPUs/Task=8 ReqB:S:C:T=0:0:*:*
   TRES=cpu=8,mem=32G,node=1,billing=8
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=8 MinMemoryNode=32G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/tools/slurm/bin/sbatch_wrap.sh jg source.tcl
   
WorkDir=/proj/trench_work4/akshabal/wa_fixes_array_sequencer/verif/fv/sunda_tpb/tpb_state_buf
   
StdErr=/proj/trench_work4/akshabal/wa_fixes_array_sequencer/verif/fv/sunda_tpb/tpb_state_buf/slurm-409999.out
   StdIn=/dev/null
   
StdOut=/proj/trench_work4/akshabal/wa_fixes_array_sequencer/verif/fv/sunda_tpb/tpb_state_buf/slurm-409999.out
   Power=

How do I go about debugging this?

Reply via email to