On 22/11/18 5:41 am, Ryan Novosielski wrote:
You can see, both of the above are examples of jobs that have allocated CPU numbers that are very different from the ultimate CPU load (the first one using way more than allocated, though they’re in a cgroup so theoretically isolated from the other users on the machine), and the second one asking for all 28 CPUs but only “using” ~8 of them.
I've just had a quick play with pestat and it reveals that Slurm 18.08.3 seems to have some odd ideas about load on nodes, for instance one of our KNL nodes that is offline is reported with a CPUload of 2.70, but I can see nothing running on it and the load average is around 0.1 (which is mostly top). Conversely a skylake node that's flat out with a load average of 32 (all from compute bound processes at 100% CPU) is reported with a CPULoad of 2.5. The CPULoad is just taken from the output of "sinfo", and I've confirmed myself that the numbers are off in that output.
If you’re using cgroups, it would seem to me that there must also be a way to see the output of “top” for just a group, or at least something similar. systemd-cgtop does more or less that, but doesn’t seem to show exactly what you’d want here:
[...] > ...CPU only being shown as an aggregate at the top level If you run: systemd-cgtop -c it will sort by CPU usage and be more useful! :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC