On 21-11-2018 19:41, Ryan Novosielski wrote:
Olm’s “pestat” script does allow you to get similar information, but I’m
interested to see if indeed there’s a better answer. I’ve used his script for
more or less the same reason, to see if the jobs are using the resources
they’re allocated. They show at a node level though, and then you have to look
closer. For example:
Print only nodes that are flagged by * (RED nodes)
Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist
State Use/Tot (MB) (MB) JobId
User ...
gpu003 oarc drng* 8 12 58.06* 64000 24507 82565618
yc567
...
hal0027 kopp_1 alloc 28 28 8.64* 128000 115610 82591085
mes373 82595703 aek119
You can see, both of the above are examples of jobs that have allocated CPU
numbers that are very different from the ultimate CPU load (the first one using
way more than allocated, though they’re in a cgroup so theoretically isolated
from the other users on the machine), and the second one asking for all 28 CPUs
but only “using” ~8 of them.
I have a possible solution with my "psjob" tool which prints a ps
process status on a job's node-list, but excludes system processes:
psjob <jobid>. Requires ClusterShell.
This allows a convenient way to get an overview of the process status of
the job's tasks. Perhaps you could check whether this information is
enough for you?
Download "psjob" (as well as other Slurm job tools) from my page:
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs
Installation of the ClusterShell prerequisite is described in my Slurm
Wiki pages at
https://wiki.fysik.dtu.dk/niflheim/SLURM#clustershell
/Ole