Hi Chris, We're seeing it on 18.08.3, so I was hoping that it was fixed in 18.08.4 (recently upgraded from 17.02 to 18.08.3). Note that we're seeing it in regular jobs (haven't tested job arrays).
I think it's cgroups-related; there's a similar bug here: https://bugs.schedmd.com/show_bug.cgi?id=6095 I was hoping that this note in the 18.08.4 NEWS might have been related: -- Fix jobacct_gather/cgroup to work correctly when more than one task is started on a node. Thanks, Paddy On Fri, Jan 04, 2019 at 03:19:18PM +0000, Christopher Benjamin Coffey wrote: > I'm surprised no one else is seeing this issue? I wonder if you have 18.08 > you can take a moment and run jobeff on a job in one of your users job > arrays. I'm guessing jobeff will show the same issue as we are seeing. The > issue is that usercpu is incorrect, and off by many orders of magnitude. > > Best, > Chris > > ??? > Christopher Coffey > High-Performance Computing > Northern Arizona University > 928-523-1167 > > > ???On 12/21/18, 2:41 PM, "Christopher Benjamin Coffey" <chris.cof...@nau.edu> > wrote: > > So this issue is occurring only with job arrays. > > ??? > Christopher Coffey > High-Performance Computing > Northern Arizona University > 928-523-1167 > > > On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce Carl > Nelson" <slurm-users-boun...@lists.schedmd.com on behalf of > chance-nel...@nau.edu> wrote: > > Hi folks, > > > calling sacct with the usercpu flag enabled seems to provide cpu > times far above expected values for job array indices. This is also reported > by seff. For example, executing the following job script: > ________________________________________________________ > > > #!/bin/bash > #SBATCH --job-name=array_test > #SBATCH --workdir=/scratch/cbn35/bigdata > #SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log > #SBATCH --time=20:00 > #SBATCH --array=1-5 > #SBATCH -c2 > > > srun stress -c 2 -m 1 --vm-bytes 500M --timeout 65s > > > > ________________________________________________________ > > > ...results in the following stats: > ________________________________________________________ > > > > JobID ReqCPUS UserCPU Timelimit Elapsed > ------------ -------- ---------- ---------- ---------- > 15730924_5 2 02:30:14 00:20:00 00:01:08 > 15730924_5.+ 2 00:00.004 00:01:08 > 15730924_5.+ 2 00:00:00 00:01:09 > 15730924_5.0 2 02:30:14 00:01:05 > 15730924_1 2 02:30:48 00:20:00 00:01:08 > 15730924_1.+ 2 00:00.013 00:01:08 > 15730924_1.+ 2 00:00:00 00:01:09 > 15730924_1.0 2 02:30:48 00:01:05 > 15730924_2 2 02:15:52 00:20:00 00:01:07 > 15730924_2.+ 2 00:00.007 00:01:07 > 15730924_2.+ 2 00:00:00 00:01:07 > 15730924_2.0 2 02:15:52 00:01:06 > 15730924_3 2 02:30:20 00:20:00 00:01:08 > 15730924_3.+ 2 00:00.010 00:01:08 > 15730924_3.+ 2 00:00:00 00:01:09 > 15730924_3.0 2 02:30:20 00:01:05 > 15730924_4 2 02:30:26 00:20:00 00:01:08 > 15730924_4.+ 2 00:00.006 00:01:08 > 15730924_4.+ 2 00:00:00 00:01:09 > 15730924_4.0 2 02:30:25 00:01:05 > > > > ________________________________________________________ > > > This is also reported by seff, with several errors to boot: > ________________________________________________________ > > > > Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff > line 130, <DATA> line 624. > Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff > line 130, <DATA> line 624. > Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff > line 130, <DATA> line 624. > Job ID: 15730924 > Array Job ID: 15730924_5 > Cluster: monsoon > User/Group: cbn35/clusterstu > State: COMPLETED (exit code 0) > Nodes: 1 > Cores per node: 2 > CPU Utilized: 03:19:15 > CPU Efficiency: 8790.44% of 00:02:16 core-walltime > Job Wall-clock time: 00:01:08 > Memory Utilized: 0.00 MB (estimated maximum) > Memory Efficiency: 0.00% of 1.95 GB (1000.00 MB/core) > > > > ________________________________________________________ > > > > > > As far as I can tell, I don't think a two core job with an elapsed > time of around one minute would have a cpu time of two hours. Could this be a > configuration issue, or is it a possible bug? > > > More info is available on request, and any help is appreciated! > > > > > > > > -- Paddy Doyle Trinity Centre for High Performance Computing, Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. Phone: +353-1-896-3725 http://www.tchpc.tcd.ie/