Hi Paul, Thank you for your reply. Good to know that in your case you get consistent replies. I had done a similar analises. Starting with a user I got from the accounting records:
sacct -X -u rsantos --starttime=2020-01-01 --endtime=now -o jobid,part,account,start,end,elapsed,alloctres%80 | grep "gres/gpu" 1473 gpu tsrp 2020-12-23T22:37:46 2020-12-23T23:31:22 00:53:36 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 1488 gpu tsrp 2020-12-23T23:35:58 2020-12-23T23:37:51 00:01:53 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 1499 gpu tsrp 2020-12-23T23:39:19 2020-12-23T23:44:21 00:05:02 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 2066 gpu tsrp 2020-12-24T01:32:32 2020-12-25T08:01:43 1-06:29:11 billing=2,cpu=2,energy=16514193,gres/gpu=1,mem=512M,node=1 2993 gpu tsrp 2020-12-29T22:36:13 2020-12-29T22:38:03 00:01:50 billing=8,cpu=8,energy=12032,gres/gpu=1,mem=2G,node=1 To prove that this user is the only one in this account with gpu usage I also did this query in terms of accounts: sacct -X -A tsrp -a --starttime=2020-01-01 --endtime=now -o user,jobid,part,account,start,end,elapsed,alloctres%80 | grep "gres/gpu" rsantos 1473 gpu tsrp 2020-12-23T22:37:46 2020-12-23T23:31:22 00:53:36 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 rsantos 1488 gpu tsrp 2020-12-23T23:35:58 2020-12-23T23:37:51 00:01:53 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 rsantos 1499 gpu tsrp 2020-12-23T23:39:19 2020-12-23T23:44:21 00:05:02 billing=8,cpu=8,gres/gpu=1,mem=2G,node=1 rsantos 2066 gpu tsrp 2020-12-24T01:32:32 2020-12-25T08:01:43 1-06:29:11 billing=2,cpu=2,energy=16514193,gres/gpu=1,mem=512M,node=1 rsantos 2993 gpu tsrp 2020-12-29T22:36:13 2020-12-29T22:38:03 00:01:50 billing=8,cpu=8,energy=12032,gres/gpu=1,mem=2G,node=1 This adds up to 1891 minutes. Querying the association I can confirm this value: scontrol -o show assoc_mgr | grep ^QOS=tsrp | grep -oP '(?<=GrpTRESMins=).[^ ]*' cpu=24000000(8769901),mem=N(8687005243),energy=N(0),node=N(201275),billing=N(8769901),fs/disk=N(0),vmem=N(0),pages=N(0),fs/lustre=N(0),gres/gpu=N(1891),ic/ofed=N(0) If I now use sreport I get a totally different number: sreport -t minutes -T gres/gpu -nP cluster AccountUtilizationByUser start=2020-01-01 end=now account=tsrp format=login,used |62 rsantos|62 I cannot understand why this is the case and if there is a situation in which these different answers can be understood. Just to prove this is not a slurm version issue I even updated my slurm to version 20.11.4 to no avail! Hope someone else can jump in here and give me some pointers! Best Regards, MAO > On 12 Mar 2021, at 19:25, Paul Raines <rai...@nmr.mgh.harvard.edu> wrote: > > > Very new to SLURM and have not used sreport before so I decided to > try your searches myself to see what they do. > > I am running 20.11.3 and it seems to match the data for me for a very > simple case I tested that I could "eyeball" > > Looking just at the day 2021-03-09 for user mu40 on account lcn > > # sreport -t minutes -T CPU -nP cluster \ > AccountUtilizationByUser start='2021-03-09' end='2021-03-10' \ > account=lcn format=login,used > |40333 > cx88|33835 > mu40|6498 > > # sreport -t minutes -T gres/gpu -nP cluster \ > AccountUtilizationByUser start='2021-03-09' end='2021-03-10' \ > account=lcn format=login,used > |13070 > cx88|9646 > mu40|3425 > > # sacct --user=mu40 --starttime=2021-03-09 --endtime=2021-03-10 \ > --account=lcn -o jobid,start,end,elapsed,alloctres%80 > > JobID Start End Elapsed > AllocTRES > ------------ ------------------- ------------------- ---------- > ----------------------------------------------------- > 190682 2021-03-05T16:25:55 2021-03-12T09:20:52 6-16:54:57 > billing=10,cpu=3,gres/gpu=2,mem=24G,node=1 > 190682.batch 2021-03-05T16:25:55 2021-03-12T09:20:53 6-16:54:58 > cpu=3,gres/gpu=2,mem=24G,node=1 > 190682.exte+ 2021-03-05T16:25:55 2021-03-12T09:20:52 6-16:54:57 > billing=10,cpu=3,gres/gpu=2,mem=24G,node=1 > 201123 2021-03-09T14:55:20 2021-03-09T14:55:23 00:00:03 > billing=9,cpu=4,gres/gpu=1,mem=96G,node=1 > 201123.exte+ 2021-03-09T14:55:20 2021-03-09T14:55:23 00:00:03 > billing=9,cpu=4,gres/gpu=1,mem=96G,node=1 > 201123.0 2021-03-09T14:55:20 2021-03-09T14:55:23 00:00:03 > cpu=4,gres/gpu=1,mem=96G,node=1 > 201124 2021-03-09T14:55:29 2021-03-10T08:13:07 17:17:38 > billing=18,cpu=4,gres/gpu=1,mem=512G,node=1 > 201124.exte+ 2021-03-09T14:55:29 2021-03-10T08:13:07 17:17:38 > billing=18,cpu=4,gres/gpu=1,mem=512G,node=1 > 201124.0 2021-03-09T14:55:29 2021-03-10T08:13:07 17:17:38 > cpu=4,gres/gpu=1,mem=512G,node=1 > > So the first job used all 24 hours of that day, the 2nd just 3 seconds > (so ignore it) and the third about 9 hours and 5 minutes > > CPU = 24*60*3+(9*60+5)*4 = 6500 > > GPU = 24*60*2+(9*60+5)*1 = 3425 > > -- Paul Raines (http://help.nmr.mgh.harvard.edu) > > On Thu, 11 Mar 2021 11:03pm, Miguel Oliveira wrote: > >> Dear all, >> >> Hope you can help me! >> In our facility we support the users via projects that have time >> allocations. Given this we use a simple bank facility developed by us along >> the ideas of the old code https://jcftang.github.io/slurm-bank/ >> <https://jcftang.github.io/slurm-bank/>. >> Our implementation differs because we have a QOS per project with a NoDecay >> flag. The basic commands used are: >> - scontrol show assoc_mgr to read the limits, >> - sacctmgr modify qos to modify the limits and, >> - sreport to read individual usage. >> We have been using this for a while in production without any single issues >> for CPU time allocations. >> >> Now we need to implement GPU time allocation as well for our new GPU >> partition. >> While the 2 first commands work fine to set or change the limits with >> gres/gpu we seem to get values with sreport that do not add up. >> In this case we use: >> >> - command='sreport -t minutes -T gres/gpu -nP cluster >> AccountUtilizationByUser start='+date_start+' end='+date_end+' >> account='+account+' format=login,used' >> >> We have confirmed via the accounting records that the total reported via >> scontrol show assoc_mgr is correct while the value given by sreport is >> totally off. >> Did I misunderstand the sreport man page and the command above is reporting >> something else or is this a bug? >> We do something similar with "-T cpu", for the CPU part of the code, and the >> number match up. We are using slurm 20.02.0. >> >> Best Regards, >> >> MAO >> >> --- >> Miguel Afonso Oliveira >> Laboratório de Computação Avançada | Laboratory for Advanced Computing >> Universidade de Coimbra | University of Coimbra >> T: +351239410681 >> E: miguel.olive...@uc.pt >> W: www.uc.pt/lca
smime.p7s
Description: S/MIME cryptographic signature