Hi;
There is a official page which gives a lot of link to third party
solutions you can use:
https://slurm.schedmd.com/download.html
According to me, the best slurm page for system administration is:
https://wiki.fysik.dtu.dk/niflheim/SLURM
At this page, You can find a lot of links and information which you
need. But, I think there is not a generally accepted or an official
solution to monitoring your cluster. Probably, it is because the slurm
is somehow a kind of hpc lego, instead of a prebuilt toy.
Regards,
Ahmet M.
8.07.2019 22:33 tarihinde Edward Ned Harvey (slurm) yazdı:
I am an experienced sysadmin, new to being a slurm admin, and I'm
encountering some difficulty:
If you have a simple question such as "how many cpu's are currently
being used in the foobar partition," or "give me an overview of the
waiting jobs and what are the reasons they're waiting" I don't have
any good easy ways yet to answer these questions. I can get the total
number of cpu's in a partition via "scontrol show partition foobar"
and I can get how many cpus are being used on a particular node via
"scontrol show node somenode" and I can get a (not easily parsable)
list of nodes within a partition via "sinfo". So all the information
is available, but very difficult to access because it would require
some very nontrivial parsing.
I see projects like this: https://github.com/fasrc/slurm_showq
<https://github.com/fasrc/slurm_showq> and
https://github.com/fasrc/scalc <https://github.com/fasrc/scalc> which
seem to be created exactly for this purpose. They're trying to make
information in slurm more easily accessible.
So, is there a better way to manage a slurm cluster, are there better
tools, or better ways to use them? Any other suggestions for me from
experienced slurm admins? Like, a cheatsheet of common commands or
scripts like slurm_showq and scalc? Or is this just the normal state
of the world?