Hi all, It looks like we can use the api to avoid having to manually parse the '2=' value from the stats{tres_usage_in_max} value.
I've submitted a bug report and patch: https://bugs.schedmd.com/show_bug.cgi?id=6004 The minimal changes needed would be in the attched seff.patch. Hope that helps, Paddy On Thu, Nov 08, 2018 at 11:54:59AM +0100, Marcus Wagner wrote: > Hi Miguel, > > > this is because SchedMD changed the stats field. There exists no more > rss_max, cmp. line 225 of seff. > You need to evaluate the field stats{tres_usage_in_max}, and there the value > after '2=', but this is the memory value in bytes instead of kbytes, so this > should be divided by 1024 additionally. > > > Best > Marcus > > On 11/08/2018 11:06 AM, Miguel A. Sánchez wrote: > > > > Hi and thanks for all your answers and sorry for the delay in my answer. > > Yesterday I have installed in the controller machine the Slurm-18.08.3 > > to check if with this last release the Seff command is working fine. The > > behavior has improve but I still receive a error message: > > > > > > # /usr/local/slurm-18.08.3/bin/seff 1694112 > > *Use of uninitialized value $lmem in numeric lt (<) at > > /usr/local/slurm-18.08.3/bin/seff line 130, <DATA> line 624.* > > Job ID: 1694112 > > Cluster: XXXXX > > User/Group: XXXXX > > State: COMPLETED (exit code 0) > > Nodes: 1 > > Cores per node: 2 > > CPU Utilized: 01:39:33 > > CPU Efficiency: 4266.43% of 00:02:20 core-walltime > > Job Wall-clock time: 00:01:10 > > Memory Utilized: 0.00 MB (estimated maximum) > > Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node) > > [root@hydra ~]# > > > > > > And due to this problem, any job shows me as memory utilized the value > > of 0.00 MB. > > > > > > With slurm-17.11.1 is working fine: > > > > > > # /usr/local/slurm-17.11.0/bin/seff 1694112 > > Job ID: 1694112 > > Cluster: XXXXX > > User/Group: XXXXX > > State: COMPLETED (exit code 0) > > Nodes: 1 > > Cores per node: 2 > > CPU Utilized: 01:39:33 > > CPU Efficiency: 4266.43% of 00:02:20 core-walltime > > Job Wall-clock time: 00:01:10 > > Memory Utilized: 2.44 GB > > Memory Efficiency: 62.57% of 3.91 GB > > [root@hydra bin]# > > > > > > > > > > Miguel A. Sánchez Gómez > > System Administrator > > Research Programme on Biomedical Informatics - GRIB (IMIM-UPF) > > > > Barcelona Biomedical Research Park (office 4.80) > > Doctor Aiguader 88 | 08003 Barcelona (Spain) > > Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550 > > e-mail:miguelangel.sanc...@upf.edu > > On 11/06/2018 06:30 PM, Mike Cammilleri wrote: > > > > > > Thanks for this. We'll try the workaround script. It is not > > > mission-critical but our users have gotten accustomed to seeing > > > these metrics at the end of each run and its nice to have. We are > > > currently doing this in a test VM environment, so by the time we > > > actually do the upgrade to the cluster perhaps the fix will be > > > available then. > > > > > > > > > Mike Cammilleri > > > > > > Systems Administrator > > > > > > Department of Statistics | UW-Madison > > > > > > 1300 University Ave | Room 1280 > > > 608-263-6673 | mi...@stat.wisc.edu > > > > > > > > > > > > ------------------------------------------------------------------------ > > > *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on > > > behalf of Chris Samuel <ch...@csamuel.org> > > > *Sent:* Tuesday, November 6, 2018 5:03 AM > > > *To:* slurm-users@lists.schedmd.com > > > *Subject:* Re: [slurm-users] Seff error with Slurm-18.08.1 > > > On 6/11/18 7:49 pm, Baker D.J. wrote: > > > > > > > The good new is that I am assured by SchedMD that the bug has been > > > fixed > > > > in v18.08.3. > > > > > > Looks like it's fixed in this commmit. > > > > > > commit 3d85c8f9240542d9e6dfb727244e75e449430aac > > > Author: Danny Auble <d...@schedmd.com> > > > Date: Wed Oct 24 14:10:12 2018 -0600 > > > > > > Handle symbol resolution errors in the 18.08 slurmdbd. > > > > > > Caused by b1ff43429f6426c when moving the slurmdbd agent internals. > > > > > > Bug 5882. > > > > > > > > > > Having said that we will probably live with this issue > > > > rather than disrupt users with another upgrade so soon . > > > > > > An upgrade to 18.08.3 from 18.08.1 shouldn't be disruptive though, > > > should it? We just flip a symlink and the users see the new binaries, > > > libraries, etc immediately, we can then restart daemons as and when we > > > need to (in the right order of course, slurmdbd, slurmctld and then > > > slurmd's). > > > > > > All the best, > > > Chris > > > -- > > > Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC > > > > > > > -- > Marcus Wagner, Dipl.-Inf. > > IT Center > Abteilung: Systeme und Betrieb > RWTH Aachen University > Seffenter Weg 23 > 52074 Aachen > Tel: +49 241 80-24383 > Fax: +49 241 80-624383 > wag...@itc.rwth-aachen.de > www.itc.rwth-aachen.de > -- Paddy Doyle Trinity Centre for High Performance Computing, Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. Phone: +353-1-896-3725 http://www.tchpc.tcd.ie/
--- a/contribs/seff/seff +++ b/contribs/seff/seff @@ -126,10 +126,13 @@ for my $step (@{$job->{'steps'}}) { $tot_cpu_sec += $step->{'tot_cpu_sec'}; $tot_cpu_usec += $step->{'tot_cpu_usec'}; - my $lmem = $step->{'stats'}{'rss_max'}; - if ($mem < $lmem) { - $mem = $lmem; - $ntasks = $step->{'ntasks'}; + if (exists $step->{'stats'} && exists $step->{'stats'}{'tres_usage_in_max'}) { + my $lmem = Slurmdb::find_tres_count_in_string($step->{'stats'}{'tres_usage_in_max'}, TRES_MEM); + + if ($mem < $lmem) { + $mem = $lmem; + $ntasks = $step->{'ntasks'}; + } } } my $cput = $tot_cpu_sec + int(($tot_cpu_usec / 1000000) + 0.5);