On 21-11-2018 19:41, Ryan Novosielski wrote:
Olm’s “pestat” script does allow you to get similar information, but I’m 
interested to see if indeed there’s a better answer. I’ve used his script for 
more or less the same reason, to see if the jobs are using the resources 
they’re allocated. They show at a node level though, and then you have to look 
closer. For example:

Print only nodes that are flagged by * (RED nodes)
Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  Joblist
                             State Use/Tot              (MB)     (MB)  JobId 
User ...

   gpu003            oarc     drng*  8  12   58.06*    64000    24507  82565618 
yc567
...
  hal0027          kopp_1    alloc  28  28    8.64*   128000   115610  82591085 
mes373 82595703 aek119

You can see, both of the above are examples of jobs that have allocated CPU 
numbers that are very different from the ultimate CPU load (the first one using 
way more than allocated, though they’re in a cgroup so theoretically isolated 
from the other users on the machine), and the second one asking for all 28 CPUs 
but only “using” ~8 of them.

I have a possible solution with my "psjob" tool which prints a ps process status on a job's node-list, but excludes system processes: psjob <jobid>. Requires ClusterShell.

This allows a convenient way to get an overview of the process status of the job's tasks. Perhaps you could check whether this information is enough for you?

Download "psjob" (as well as other Slurm job tools) from my page:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs

Installation of the ClusterShell prerequisite is described in my Slurm Wiki pages at
https://wiki.fysik.dtu.dk/niflheim/SLURM#clustershell

/Ole

Reply via email to