Re: [slurm-users] Areas for improvement on our site's cluster scheduling

Ole Holm Nielsen Tue, 08 May 2018 04:39:43 -0700

On 05/08/2018 09:49 AM, John Hearns wrote:

Actually what IS bad is users not putting cluster resources to good use.You can often see jobs which are 'stalled' - ie the nodes are reservedfor the job,but the internal logic of the job has failed and the executables havenot launched. Or maybe some user is running an interactive job and haswanderedoff for coffee/beer/an extended holiday. It is well worth scanning forstalled jobs and terminating them.

I agree, and the way I monitor our cluster for jobs that do little or nouseful work is through my small utility "pestat" available fromhttps://github.com/OleHolmNielsen/Slurm_tools/tree/master/pestat


I run "pestat -F" many times every day to spot inefficient jobs.

If I want to list the user processes belonging to a job, I use "psjob<jobid>". I notify users and possibly cancel their jobs using the"notifybadjob <jobid>" script. These tools are available athttps://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs


/Ole

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

Reply via email to