I am an experienced sysadmin, new to being a slurm admin, and I'm encountering 
some difficulty:

If you have a simple question such as "how many cpu's are currently being used 
in the foobar partition," or "give me an overview of the waiting jobs and what 
are the reasons they're waiting" I don't have any good easy ways yet to answer 
these questions. I can get the total number of cpu's in a partition via 
"scontrol show partition foobar" and I can get how many cpus are being used on 
a particular node via "scontrol show node somenode" and I can get a (not easily 
parsable) list of nodes within a partition via "sinfo". So all the information 
is available, but very difficult to access because it would require some very 
nontrivial parsing.

I see projects like this: https://github.com/fasrc/slurm_showq and 
https://github.com/fasrc/scalc which seem to be created exactly for this 
purpose. They're trying to make information in slurm more easily accessible.

So, is there a better way to manage a slurm cluster, are there better tools, or 
better ways to use them? Any other suggestions for me from experienced slurm 
admins? Like, a cheatsheet of common commands or scripts like slurm_showq and 
scalc? Or is this just the normal state of the world?


Reply via email to