Hello John, I also am keen to follow your progress, as this is something we would find extremely useful as well.
Regards, Jason On Fri, Sep 8, 2023 at 4:47 AM John Snowdon <john.snow...@newcastle.ac.uk> wrote: > I've been needing to do this as part of some analysis work we are > undertaking to determine requirements for a replacement system. > > We don't have anything structured in place currently to analyse Slurm > data; lots of Grafana system-level metrics but nothing to look at trends of > useful metrics like: > > - Size (and age) of jobs sitting in the pending state > - Average runtime of jobs > - Plotting workload sizing information such as cores/job and memory/core > so that we can understand how our users are utilising the service > - Demand (and utilisation) of particular partitions > > I couldn't find anything that was exactly what we wanted, so I spent a > couple of afternoons last week putting something together in Python to wrap > around sacct / sinfo output. > > So far I've got reports for what is happening 'now', as well as summaries > for the following periods: > > 24 hours > 7 days > 30 days > 1 year > > Data is analysed based on jobs running/pending/completed/failed during > windows in time and summarised in terms of sample periods per day (a 24 > report having the finest sampling resolution of 6x 10 minute windows per > hour), and the output of each sample period is stored as a persistent json > object on the filesystem in case the same report is ran again, or that > period is included as part of a larger analysis window. > > I output to flat HTML files using the Jinja2 templating module and > visualise data using the ubiquitous Highcharts and DataTables javascript > libraries. > > In our case we're more interested in things like: > > Min/Max/Median cores/job, plus lowest average value which would satisfy X% > of all jobs > Min/Max/Median memory/core, plus lowest average value which would satisfy > X% of all jobs > Min/Max/Median nodes/job, plus lowest average value which would satisfy X% > of all jobs > Backlog of jobs waiting in pending state > Percentage of jobs that 'fail' (end up in some state other than completed) > Scatter chart of cores/job to memory/core (i.e. what is the bulk of our > user workload; parallel/serial, low memory/high memory?) > > i.e. data points which will be useful in our sizing decisions of a > replacement platform, both in terms of hardware, as well as partition > definitions. > > When it's at a point where it is useable, I'm sure that we can share the > code. It's pretty much self-contained; the only dependencies being Slurm > and Python 3 installed - no web components needed (unless you want to serve > the generated reports to users, of course). > > John Snowdon > Advanced Computing Consultant > > Newcastle University IT Service > The Elizabeth Barraclough Building > 91 Sandyford Road > Newcastle upon Tyne, > NE1 8HW > > -- *Jason L. Simms, Ph.D., M.P.H.* Manager of Research Computing Swarthmore College Information Technology Services (610) 328-8102 Schedule a meeting: https://calendly.com/jlsimms