Hi there,
Le 13/11/2017 à 18:18, Nicholas McCollum a écrit :
Now that there is a slurm-users mailing list, I thought I would share
something with the community that I have been working on to see if anyone else
is interested in it. I have a lot of students on my cluster and I really
wanted a way to show my users how efficient their jobs are, or let them know
that they are wasting resources.
I created a few scripts that leverage Graphite and whisper databases (RRD like)
to gather metrics from Slurm jobs running in cgroups. The resolution for the
metrics is defined by the retention interval that you specify in graphite. In
my case I can store 1 minute metrics for CPU usage and Memory usage for the
entire lifetime of a job.
FWIW, we wrote at EDF a collectd[1] plugin some time ago that does
basically the same thing, ie. exploring the cgroups to get cpu/memory
metrics out of jobs' processes. Code is here:
https://github.com/collectd/collectd/pull/1198
Then, you gain all collectd flexibility in terms of metrics processing
and backends (graphite, RRD, influxdb, and so on).
We also wrote a tiny web interface to visualize the metrics. One can
find out more by searching 'jobmetrics' in the following slides:
https://slurm.schedmd.com/SLUG16/EDF.pdf
NB: my intent is just to share, not to steal the thread. Please forgive
me if you take it the wrong way.
Best,
Rémi
[1] https://collectd.org/