Hi Edward,

The squeue command tells you about job status. You can get extra information using format options (see the squeue man-page). I like to set this environment variable for squeue:

export SQUEUE_FORMAT="%.18i %.9P %.6q %.8j %.8u %.8a %.10T %.9Q %.10M %.10V %.9l %.6D %.6C %m %R"

When some jobs are pending with Reason=Priority this means that other jobs with a higher priority are waiting for the same resources (CPUs) to become available, and they will have Pending=Resources in the squeue output.

When you have idle nodes, yet jobs are pending, this probably means that you have defined your Slurm partitions inappropriately with incorrect limits or resources - it's hard to guess. Use "scontrol show partitions" to display partition settings.

/Ole

On 7/9/19 2:37 AM, Edward Ned Harvey (slurm) wrote:
I have a cluster, where I submit a bunch (600) jobs, but the cluster only runs about 20 at a time. By using pestat, I can see there are a bunch of systems with plenty of available cpu and memory.

Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem

                             State Use/Tot              (MB)     (MB)

  pcomp13          batch*     idle   0  72    8.19*   258207   202456

  pcomp14          batch*     idle   0  72    0.00    258207   206558

  pcomp16          batch*     idle   0  72    0.05    258207   230609

  pcomp17          batch*     idle   0  72    8.51*   258207   184492

  pcomp18           batch      mix  14  72    0.29*   258207   230575

  pcomp19          batch*     idle   0  72   10.11*   258207   179604

  pcomp20          batch*     idle   0  72    9.56*   258207   180961

  pcomp21          batch*     idle   0  72    0.10    258207   227255

  pcomp25          batch*     idle   0  72    0.07    258207   218035

  pcomp26          batch*     idle   0  72    0.03    258207   226489

  pcomp27          batch*     idle   0  72    0.25    258207   228580

  pcomp28          batch*     idle   0  72    8.15*   258207   184306

  pcomp29           batch      mix   2  72    0.01*   258207   226256

How can I tell why jobs aren't running? "scontrol show job 123456" shows "JobState=PENDING Reason=Priority" which doesn't shed any light on the situation for me. The pending jobs have requested 1 cpu each and 2G of memory.

Should I just restart slurm daemons? Or is there some way for me to see why these systems aren't running jobs?


Reply via email to