Hi Aravindh

For our small 3 node cluster I've hacked together a per-node python script that 
collects current and peak cpu, memory and scratch disk usage data on all jobs 
running on the cluster and builds a fairly simple web-page based on it. It 
shouldn't be hard to make it store those data points over time, then shove them 
through an R script to plot the usage:

https://github.com/shawarden/simple-web?

Cheers,
  Sam

________________________________
Sam Hawarden
Assistant Research Fellow
Pathology Department
Dunedin School of Medicine
________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Aravindh 
Sampathkumar <aravi...@fastmail.com>
Sent: Monday, 10 December 2018 02:39
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] CPU & memory usage summary for a job

Hi All.

I was wondering if anybody has thought of or hacked around a way to record CPU 
and memory consumption of a job during its entire duration and give a summary 
of the usage pattern within that job?
Not the MaxRSS and CPU Time that already gets reported for every job.

I'm thinking more like a chart of CPU utilisation, memory usage, and disk usage 
on a per second basis or something like that.

Asking because some of my users have no clue about the resource consumption of 
their jobs, and just blindly ask for way more resources as "safe" option. It 
would be a nice way for users to know simple things like - they asked for 8 
cores, but their job ran on just 1 core the entire time because a library they 
used is single core limited.
We use Cgroups for process accounting and limiting job's cpu and memory usage. 
We also use QoS for limiting resource reservations at user level.

--
  Aravindh Sampathkumar
  aravi...@fastmail.com


Reply via email to