On 05/08/2018 09:49 AM, John Hearns wrote:
Actually what IS bad is users not putting cluster resources to good use. You can often see jobs which are 'stalled'  - ie the nodes are reserved for the job, but the internal logic of the job has failed and the executables have not launched. Or maybe some user is running an interactive job and has wandered off for coffee/beer/an extended holiday.  It is well worth scanning for stalled jobs and terminating them.

I agree, and the way I monitor our cluster for jobs that do little or no useful work is through my small utility "pestat" available from https://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat

I run "pestat -F" many times every day to spot inefficient jobs.

If I want to list the user processes belonging to a job, I use "psjob <jobid>". I notify users and possibly cancel their jobs using the "notifybadjob <jobid>" script. These tools are available at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs

/Ole

Reply via email to