[slurm-users] sacct runtime performance varies on job status codes

John Snowdon Fri, 01 Sep 2023 01:01:47 -0700

Hi,

I am attempting to pull some historical information from our HPC system to 
analyse some trends of our users over time.


As part of this I am using sacct to make a number of queries for different jobs 
statuses (running, pending, completed, 'other') over particular time periods 
(hourly, daily, etc).

I have noticed that most of my results with sacct return in the order of a few 
hundred milliseconds, regardless of rows (anywhere from none to several 
thousand).

However there are two distinct job status codes that result in a huge delay of 
between 30 seconds to over 1 minute, irrespective of the number of rows 
returned.

Any job status code in the list of R,CD,CA,DL,F,NF,PR,RS,RV,OOM,TO returns 
quickly, but PD and S queries are inordinately slow. Examples:

# Jobs in running state:

$ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 --state=R 
| wc -l
sacct: Jobs RUNNING in the time window from Fri Sep 01 00:00:00 2023 to Fri Sep 
01 00:59:59 2023
sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin 
loaded
281

real    0m0.095s
user    0m0.032s
sys     0m0.012s

# Jobs with an 'abnormal' state:

$ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 
--state=CA,DL,F,NF,PR,RS,RV,OOM,TO | wc -l
sacct: Jobs 
CANCELLED,DEADLINE,FAILED,NODE_FAIL,PREEMPTED,PENDING,RESIZING,PENDING,REVOKED,OUT_OF_MEMORY,TIMEOUT
 in the time window from Fri Sep 01 00:00:00 2023 to Fri Sep 01 00:59:59 2023
sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin 
loaded
132

real    0m0.088s
user    0m0.033s
sys     0m0.014s

... but looking at suspended or pending job states:

$ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 --state=PD 
| wc -l
sacct: Jobs PENDING in the time window from Fri Sep 01 00:00:00 2023 to Fri Sep 
01 00:59:59 2023
sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin 
loaded
2000

real    0m45.712s
user    0m0.041s
sys     0m0.013s

$ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 --state=S 
| wc -l
sacct: Jobs SUSPENDED in the time window from Fri Sep 01 00:00:00 2023 to Fri 
Sep 01 00:59:59 2023
sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin 
loaded
1

real    1m20.490s
user    0m0.033s
sys     0m0.006s

Our sacct version reports:

$ sacct -V
slurm 20.11.8

The current performance makes my efforts to analyse the size of the tail of 
pending jobs (and thus one of the criteria we want to use to understand whether 
we are coping with user submission demand) impractical - it seems to be more 
than 100x slower than querying which jobs were running at any point in time.

Some things which I've observed:

- Use of start/end or the default time window doesn't matter
- Size of time window set by start/end doesn't matter
- Querying a list of status codes or single states doesn't matter (single or 
listed codes of everything but PD and S is fast)

Is this likely to be behaviour of the sacct client, or is there a fundamental 
difference in the database schema that somehow would make queries for S and PD 
jobs slower by several factors?

John Snowdon
Advanced Computing Consultant

Newcastle University IT Service
The Elizabeth Barraclough Building
91 Sandyford Road
Newcastle upon Tyne, 
NE1 8HW

[slurm-users] sacct runtime performance varies on job status codes

Reply via email to