Hi Edward,
The squeue command tells you about job status. You can get extra
information using format options (see the squeue man-page). I like to
set this environment variable for squeue:
export SQUEUE_FORMAT="%.18i %.9P %.6q %.8j %.8u %.8a %.10T %.9Q %.10M
%.10V %.9l %.6D %.6C %m %R"
When some jobs are pending with Reason=Priority this means that other
jobs with a higher priority are waiting for the same resources (CPUs) to
become available, and they will have Pending=Resources in the squeue output.
When you have idle nodes, yet jobs are pending, this probably means that
you have defined your Slurm partitions inappropriately with incorrect
limits or resources - it's hard to guess. Use "scontrol show
partitions" to display partition settings.
/Ole
On 7/9/19 2:37 AM, Edward Ned Harvey (slurm) wrote:
I have a cluster, where I submit a bunch (600) jobs, but the cluster
only runs about 20 at a time. By using pestat, I can see there are a
bunch of systems with plenty of available cpu and memory.
Hostname Partition Node Num_CPU CPUload Memsize Freemem
State Use/Tot (MB) (MB)
pcomp13 batch* idle 0 72 8.19* 258207 202456
pcomp14 batch* idle 0 72 0.00 258207 206558
pcomp16 batch* idle 0 72 0.05 258207 230609
pcomp17 batch* idle 0 72 8.51* 258207 184492
pcomp18 batch mix 14 72 0.29* 258207 230575
pcomp19 batch* idle 0 72 10.11* 258207 179604
pcomp20 batch* idle 0 72 9.56* 258207 180961
pcomp21 batch* idle 0 72 0.10 258207 227255
pcomp25 batch* idle 0 72 0.07 258207 218035
pcomp26 batch* idle 0 72 0.03 258207 226489
pcomp27 batch* idle 0 72 0.25 258207 228580
pcomp28 batch* idle 0 72 8.15* 258207 184306
pcomp29 batch mix 2 72 0.01* 258207 226256
How can I tell why jobs aren't running? "scontrol show job 123456" shows
"JobState=PENDING Reason=Priority" which doesn't shed any light on the
situation for me. The pending jobs have requested 1 cpu each and 2G of
memory.
Should I just restart slurm daemons? Or is there some way for me to see
why these systems aren't running jobs?