Good morning,

We have at least one billed account right now, where the associated researchers 
are able to submit jobs that run against our normal queue with fairshare, but 
not for an academic research purpose. So we'd like to accurately calculate 
their CPU hours. We are currently using a script to query the db with sacct and 
sum up the value of ElapsedRaw * AllocCPUS for all jobs. But this seems 
limited, because requeueing will create what the sacct man page calls 
duplicates. By default jobs normally get requeued only if there's something 
outside of the user's control like a NODE_FAIL or an scontrol command to 
requeue it manually, though I think users can requeue things themselves, it's 
not a feature we've seen our researchers use.

However with the new scrontab feature, whenever the cron is executed more than 
once, sacct reports that the previous jobs are "requeued" and are only visible 
by looking up duplicates. I haven't seen any billed account use requeueing or 
scrontab yet, but it's clear to me that it could be significant once 
researchers start using scrontab more. Scrontab has existed since one of the 
releases from 2020 I believe, but we enabled it this year and see it as much 
more powerful than the traditional linux crontab.

What would be the best way to more thoroughly calculate ElapsedRaw * AllocCPUS, 
to account for duplicates, but optionally ignore unintentional requeueing like 
from a NODE_FAIL?

Here's the main loop of the simple bash script I have now:

while IFS='|' read -r end elapsed cpus; do
    # if a job crosses the month barrier
    # the entire bill will be put under the 2nd month
    year_month="${end:0:7}"
    if [[ ! "$elapsed" =~ ^[0-9]+$ ]] || [[ ! "$cpus" =~ ^[0-9]+$ ]]; then
        continue
    fi
    core_seconds["$year_month"]=$(( core_seconds["$year_month"] + (elapsed * 
cpus) ))
done < <(sacct -a -A "$SLURM_ACCOUNT" \
               -S "$START_DATE" \
               -E "$END_DATE" \
               -o End,ElapsedRaw,AllocCPUS -X -P --noheader)

Our slurmdbd is configured to keep 6 months of data.

It make senses to loop through the jobids instead, using sacct's 
-D/--duplicates option each time to reveal the hidden duplicates in the 
REQUEUED state, but I'm interested if there are alternatives or if I'm missing 
anything here.

Thanks,

Joseph


--------------------------------------------------------------
Joseph F. Guzman - ITS (Advanced Research Computing)

Northern Arizona University

joseph.f.guz...@nau.edu

Reply via email to