On 05/08/2018 09:49 AM, John Hearns wrote:
Actually what IS bad is users not putting cluster resources to good use.
You can often see jobs which are 'stalled' - ie the nodes are reserved
for the job,
but the internal logic of the job has failed and the executables have
not launched. Or maybe some user is running an interactive job and has
wandered
off for coffee/beer/an extended holiday. It is well worth scanning for
stalled jobs and terminating them.
I agree, and the way I monitor our cluster for jobs that do little or no
useful work is through my small utility "pestat" available from
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat
I run "pestat -F" many times every day to spot inefficient jobs.
If I want to list the user processes belonging to a job, I use "psjob
<jobid>". I notify users and possibly cancel their jobs using the
"notifybadjob <jobid>" script. These tools are available at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
/Ole