On 11/22/2018 12:10 AM, Christopher Samuel wrote:
I've just had a quick play with pestat and it reveals that Slurm
18.08.3 seems to have some odd ideas about load on nodes, for instance
one of our KNL nodes that is offline is reported with a CPUload of
2.70, but I can see nothing running on it and the load average is
around 0.1 (which is mostly top).
Conversely a skylake node that's flat out with a load average of 32
(all from compute bound processes at 100% CPU) is reported with a
CPULoad of 2.5.
The CPULoad is just taken from the output of "sinfo", and I've confirmed
myself that the numbers are off in that output.
FYI: Here's the sinfo flags which I use in pestat:
# sinfo output: NODELIST PARTITION CPU CPU_LOAD MEMORY FREE_MEM STATE GRES
sinfo -N -o "%N %P %C %O %m %e %t %Z %G"
The CPU_LOAD output should originate from the slurmd daemon running on
each compute node. Chris' observations might indicate that slurmd
version 18.08.3 doesn't show the correct CPU_LOAD numbers. Our cluster
runs 17.11.12 and I don't see any such problems!
/Ole