Re: [slurm-users] Graphing job metrics

Rémi Palancher Tue, 14 Nov 2017 04:35:55 -0800

Hi there,

Le 13/11/2017 à 18:18, Nicholas McCollum a écrit :

Now that there is a slurm-users mailing list, I thought I would share
something with the community that I have been working on to see if anyone else
is interested in it.  I have a lot of students on my cluster and I really
wanted a way to show my users how efficient their jobs are, or let them know
that they are wasting resources.


I created a few scripts that leverage Graphite and whisper databases (RRD like)
to gather metrics from Slurm jobs running in cgroups.  The resolution for the
metrics is defined by the retention interval that you specify in graphite.  In
my case I can store 1 minute metrics for CPU usage and Memory usage for the
entire lifetime of a job.

FWIW, we wrote at EDF a collectd[1] plugin some time ago that doesbasically the same thing, ie. exploring the cgroups to get cpu/memorymetrics out of jobs' processes. Code is here:


  https://github.com/collectd/collectd/pull/1198

Then, you gain all collectd flexibility in terms of metrics processingand backends (graphite, RRD, influxdb, and so on).

We also wrote a tiny web interface to visualize the metrics. One canfind out more by searching 'jobmetrics' in the following slides:


  https://slurm.schedmd.com/SLUG16/EDF.pdf

NB: my intent is just to share, not to steal the thread. Please forgiveme if you take it the wrong way.


Best,
Rémi

[1] https://collectd.org/

Re: [slurm-users] Graphing job metrics

Reply via email to