[slurm-users] Re: Setting up fairshare accounting

tluchko via slurm-users Tue, 24 Sep 2024 10:46:42 -0700

Just following up on my own message in case someone else is trying to figure 
out RawUsage and Fair Share.


I ran some additional tests, except that I ran jobs for 10 min instead of 1 
min. The procedure was

1. Set the accounting stats to update every minute in slurm.conf

PriorityCalcPeriod=1

2. Reset the RawUsage stat

sacctmgr modify account luchko_group set RawUsage=0

3. Check the RawUsage every second

while sleep 1; do date; sshare -ao Account,User,RawShares,NormShares,RawUsage ; 
done > watch.out

4. Run a 10 min job. The billing per CPU is 1, so the total RawUsage should 
60,000 and the RawUsage should increase 6,000 each minute

sbatch --account=luchko_group --wrap="sleep 600" -p cpu -n 100

Scanning the output file, I can see that the RawUsage does update once every 
minute. Below are the updates. (I've removed irrelevant output.)

Tue Sep 24 10:14:24 AM PDT 2024
Account User RawShares NormShares RawUsage
-------------------- ---------- ---------- ----------- -----------
luchko_group tluchko 100 0.500000 0

Tue Sep 24 10:14:25 AM PDT 2024
luchko_group tluchko 100 0.500000 4099
Tue Sep 24 10:15:24 AM PDT 2024
luchko_group tluchko 100 0.500000 10099Tue Sep 24 10:16:25 AM PDT 2024
luchko_group tluchko 100 0.500000 16099
Tue Sep 24 10:17:24 AM PDT 2024
luchko_group tluchko 100 0.500000 22098

Tue Sep 24 10:18:25 AM PDT 2024

luchko_group tluchko 100 0.500000 28097

Tue Sep 24 10:19:24 AM PDT 2024

luchko_group tluchko 100 0.500000 34096

Tue Sep 24 10:20:25 AM PDT 2024

luchko_group tluchko 100 0.500000 40094

Tue Sep 24 10:21:24 AM PDT 2024

luchko_group tluchko 100 0.500000 46093

Tue Sep 24 10:22:25 AM PDT 2024

luchko_group tluchko 100 0.500000 52091

Tue Sep 24 10:23:24 AM PDT 2024

luchko_group tluchko 100 0.500000 58089

Tue Sep 24 10:24:25 AM PDT 2024

luchko_group 2000 0.133324 58087

Tue Sep 24 10:25:25 AM PDT 2024

luchko_group tluchko 100 0.500000 58085

So, the RawUsage does increase by the expected amount each minute, and the 
RawUsage does decay (I have the half-life set to 14 days). However, the update 
for the last part of a minute, which should be 1901, is not recorded. I suspect 
this is because the job is no longer running when the accounting update occurs.

For typical jobs that run for hours or days, this is a negligible error, but it 
does explain the results I got when I ran a 1 min job.

TRESRunMins is still not updating, but this is an inconvenience.

Tyler

Sent with [Proton Mail](https://proton.me/mail/home) secure email.

On Thursday, September 19th, 2024 at 8:47 PM, tluchko via slurm-users 
<slurm-users@lists.schedmd.com> wrote:

> Hello,
>
> I'm hoping someone can offer some suggestions.
>
> I went ahead started the database from scratch and reinitialized it to see if 
> that would help and to try and understand how RawUsage is calculated. I ran 
> two jobs of
>
> sbatch --account=luchko_group --wrap="sleep 60" -p cpu -n 100
>
> With the partition defined as
>
> PriorityFlags=MAX_TRES
> PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 
> State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
>
> I expected each job to contribute 6000 to the RawUsage, however one job 
> contributed 3100 and the other 2800. And TRESRunMins stayed at 0 for all 
> categories.
>
> I'm at a loss as to what is going on.
>
> Thank you,
>
> Tyler
>
> Sent with [Proton Mail](https://proton.me/mail/home) secure email.
>
> On Tuesday, September 10th, 2024 at 9:03 PM, tluchko <tluc...@protonmail.com> 
> wrote:
>
>> Hello,
>>
>> We have a new cluster and I'm trying to setup fairshare accounting. I'm 
>> trying to track CPU, MEM and GPU. It seems that billing for individual jobs 
>> is correct, but billing isn't being accumulated (TRESRunMin is always 0).
>>
>> In my slurm.conf, I think the relevant lines are
>>
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageTRES=gres/gpu
>> PriorityFlags=MAX_TRES
>>
>> PartitionName=gpu Nodes=node[1-7] MaxCPUsPerNode=384 MaxTime=7-0:00:00 
>> State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
>> PartitionName=cpu Nodes=node[1-7] MaxCPUsPerNode=182 MaxTime=7-0:00:00 
>> State=UP TRESBillingWeights="CPU=1.0,MEM=0.125G,GRES/gpu=9.6"
>> I currently have one recently finished job and one running job. sacct gives
>>
>> $ sacct 
>> --format=JobID,JobName,ReqTRES%50,AllocTRES%50,TRESUsageInAve%50,TRESUsageInMax%50
>> JobID JobName ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax
>> ------------ ---------- -------------------------------------------------- 
>> -------------------------------------------------- 
>> -------------------------------------------------- 
>> --------------------------------------------------
>> 154 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 
>> billing=9,cpu=2,gres/gpu=1,mem=2G,node=1
>> 154.interac+ interacti+ cpu=2,gres/gpu=1,mem=2G,node=1 
>> cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+ 
>> cpu=00:00:00,energy=0,fs/disk=2480503,mem=3M,page+
>> 155 interacti+ billing=9,cpu=1,gres/gpu=1,mem=1G,node=1 
>> billing=9,cpu=2,gres/gpu=1,mem=2G,node=1155.interac+ interacti+ 
>> cpu=2,gres/gpu=1,mem=2G,node=1
>>
>> billing=9 seems correct to me, since I have 1 GPU allocated, which has the 
>> largest score of 9.6. However, sshare doesn't show anything in TRESRunMins
>>
>> sshare 
>> --format=Account,User,RawShares,FairShare,RawUsage,EffectvUsage,TRESRunMins%110
>> Account User RawShares FairShare RawUsage EffectvUsage TRESRunMins
>> -------------------- ---------- ---------- ---------- ----------- 
>> ------------- 
>> --------------------------------------------------------------------------------------------------------------
>> root 21589714 1.000000 
>> cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
>> abrol_group 2000 0 0.000000 
>> cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
>> luchko_group 2000 21589714 1.000000 
>> cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
>>  luchko_group tluchko 1 0.333333 21589714 1.000000 
>> cpu=0,mem=0,energy=0,node=0,billing=0,fs/disk=0,vmem=0,pages=0,gres/gpu=0,gres/gpumem=0,gres/gpuutil=0
>>
>> Why is TRESRunMin all 0 but RawUsage is not for tluchko? I have checked and 
>> slurmdbd is running.
>>
>> Thank you,
>>
>> Tyler
>>
>> Sent with [Proton Mail](https://proton.me/) secure email.

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Setting up fairshare accounting

Reply via email to