Hi there,

Le 13/11/2017 à 18:18, Nicholas McCollum a écrit :
Now that there is a slurm-users mailing list, I thought I would share
something with the community that I have been working on to see if anyone else
is interested in it.  I have a lot of students on my cluster and I really
wanted a way to show my users how efficient their jobs are, or let them know
that they are wasting resources.

I created a few scripts that leverage Graphite and whisper databases (RRD like)
to gather metrics from Slurm jobs running in cgroups.  The resolution for the
metrics is defined by the retention interval that you specify in graphite.  In
my case I can store 1 minute metrics for CPU usage and Memory usage for the
entire lifetime of a job.

FWIW, we wrote at EDF a collectd[1] plugin some time ago that does basically the same thing, ie. exploring the cgroups to get cpu/memory metrics out of jobs' processes. Code is here:

  https://github.com/collectd/collectd/pull/1198

Then, you gain all collectd flexibility in terms of metrics processing and backends (graphite, RRD, influxdb, and so on).

We also wrote a tiny web interface to visualize the metrics. One can find out more by searching 'jobmetrics' in the following slides:

  https://slurm.schedmd.com/SLUG16/EDF.pdf

NB: my intent is just to share, not to steal the thread. Please forgive me if you take it the wrong way.

Best,
Rémi

[1] https://collectd.org/

Reply via email to