Would job profiling with HDF5 work as well? https://slurm.schedmd.com/hdf5_profile_user_guide.html
Jacob On Sun, Dec 9, 2018 at 4:17 PM Sam Hawarden <sam.hawar...@otago.ac.nz> wrote: > Hi Aravindh > > For our small 3 node cluster I've hacked together a per-node python script > that collects current and peak cpu, memory and scratch disk usage data on > all jobs running on the cluster and builds a fairly simple web-page based > on it. It shouldn't be hard to make it store those data points over time, > then shove them through an R script to plot the usage: > > https://github.com/shawarden/simple-web > > Cheers, > Sam > > ------------------------------ > Sam Hawarden > Assistant Research Fellow > Pathology Department > Dunedin School of Medicine > ------------------------------ > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Aravindh Sampathkumar <aravi...@fastmail.com> > *Sent:* Monday, 10 December 2018 02:39 > *To:* slurm-users@lists.schedmd.com > *Subject:* [slurm-users] CPU & memory usage summary for a job > > Hi All. > > I was wondering if anybody has thought of or hacked around a way to record > CPU and memory consumption of a job during its entire duration and give a > summary of the usage pattern within that job? > Not the MaxRSS and CPU Time that already gets reported for every job. > > I'm thinking more like a chart of CPU utilisation, memory usage, and disk > usage on a per second basis or something like that. > > Asking because some of my users have no clue about the resource > consumption of their jobs, and just blindly ask for way more resources as > "safe" option. It would be a nice way for users to know simple things like > - they asked for 8 cores, but their job ran on just 1 core the entire time > because a library they used is single core limited. > We use Cgroups for process accounting and limiting job's cpu and memory > usage. We also use QoS for limiting resource reservations at user level. > > -- > Aravindh Sampathkumar > aravi...@fastmail.com > > >