Hi Aravindh For our small 3 node cluster I've hacked together a per-node python script that collects current and peak cpu, memory and scratch disk usage data on all jobs running on the cluster and builds a fairly simple web-page based on it. It shouldn't be hard to make it store those data points over time, then shove them through an R script to plot the usage:
https://github.com/shawarden/simple-web? Cheers, Sam ________________________________ Sam Hawarden Assistant Research Fellow Pathology Department Dunedin School of Medicine ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Aravindh Sampathkumar <aravi...@fastmail.com> Sent: Monday, 10 December 2018 02:39 To: slurm-users@lists.schedmd.com Subject: [slurm-users] CPU & memory usage summary for a job Hi All. I was wondering if anybody has thought of or hacked around a way to record CPU and memory consumption of a job during its entire duration and give a summary of the usage pattern within that job? Not the MaxRSS and CPU Time that already gets reported for every job. I'm thinking more like a chart of CPU utilisation, memory usage, and disk usage on a per second basis or something like that. Asking because some of my users have no clue about the resource consumption of their jobs, and just blindly ask for way more resources as "safe" option. It would be a nice way for users to know simple things like - they asked for 8 cores, but their job ran on just 1 core the entire time because a library they used is single core limited. We use Cgroups for process accounting and limiting job's cpu and memory usage. We also use QoS for limiting resource reservations at user level. -- Aravindh Sampathkumar aravi...@fastmail.com