Hi Chris,

We're seeing it on 18.08.3, so I was hoping that it was fixed in 18.08.4
(recently upgraded from 17.02 to 18.08.3). Note that we're seeing it in
regular jobs (haven't tested job arrays).

I think it's cgroups-related; there's a similar bug here:

https://bugs.schedmd.com/show_bug.cgi?id=6095

I was hoping that this note in the 18.08.4 NEWS might have been related:

-- Fix jobacct_gather/cgroup to work correctly when more than one task is
   started on a node.

Thanks,
Paddy

On Fri, Jan 04, 2019 at 03:19:18PM +0000, Christopher Benjamin Coffey wrote:

> I'm surprised no one else is seeing this issue? I wonder if you have 18.08 
> you can take a moment and run jobeff on a job in one of your users job 
> arrays. I'm guessing jobeff will show the same issue as we are seeing. The 
> issue is that usercpu is incorrect, and off by many orders of magnitude.
> 
> Best,
> Chris
> 
> ???
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
>  
> 
> ???On 12/21/18, 2:41 PM, "Christopher Benjamin Coffey" <chris.cof...@nau.edu> 
> wrote:
> 
>     So this issue is occurring only with job arrays.
>     
>     ???
>     Christopher Coffey
>     High-Performance Computing
>     Northern Arizona University
>     928-523-1167
>      
>     
>     On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce Carl 
> Nelson" <slurm-users-boun...@lists.schedmd.com on behalf of 
> chance-nel...@nau.edu> wrote:
>     
>         Hi folks,
>         
>         
>         calling sacct with the usercpu flag enabled seems to provide cpu 
> times far above expected values for job array indices. This is also reported 
> by seff. For example, executing the following job script:
>         ________________________________________________________
>         
>         
>         #!/bin/bash
>         #SBATCH --job-name=array_test                   
>         #SBATCH --workdir=/scratch/cbn35/bigdata          
>         #SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log
>         #SBATCH --time=20:00  
>         #SBATCH --array=1-5
>         #SBATCH -c2
>         
>         
>         srun stress -c 2 -m 1 --vm-bytes 500M --timeout 65s
>         
>         
>         
>         ________________________________________________________
>         
>         
>         ...results in the following stats:
>         ________________________________________________________
>         
>         
>         
>                JobID  ReqCPUS    UserCPU  Timelimit    Elapsed 
>         ------------ -------- ---------- ---------- ---------- 
>         15730924_5          2   02:30:14   00:20:00   00:01:08 
>         15730924_5.+        2  00:00.004              00:01:08 
>         15730924_5.+        2   00:00:00              00:01:09 
>         15730924_5.0        2   02:30:14              00:01:05 
>         15730924_1          2   02:30:48   00:20:00   00:01:08 
>         15730924_1.+        2  00:00.013              00:01:08 
>         15730924_1.+        2   00:00:00              00:01:09 
>         15730924_1.0        2   02:30:48              00:01:05 
>         15730924_2          2   02:15:52   00:20:00   00:01:07 
>         15730924_2.+        2  00:00.007              00:01:07 
>         15730924_2.+        2   00:00:00              00:01:07 
>         15730924_2.0        2   02:15:52              00:01:06 
>         15730924_3          2   02:30:20   00:20:00   00:01:08 
>         15730924_3.+        2  00:00.010              00:01:08 
>         15730924_3.+        2   00:00:00              00:01:09 
>         15730924_3.0        2   02:30:20              00:01:05 
>         15730924_4          2   02:30:26   00:20:00   00:01:08 
>         15730924_4.+        2  00:00.006              00:01:08 
>         15730924_4.+        2   00:00:00              00:01:09 
>         15730924_4.0        2   02:30:25              00:01:05 
>         
>         
>         
>         ________________________________________________________
>         
>         
>         This is also reported by seff, with several errors to boot:
>         ________________________________________________________
>         
>         
>         
>         Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff 
> line 130, <DATA> line 624.
>         Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff 
> line 130, <DATA> line 624.
>         Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff 
> line 130, <DATA> line 624.
>         Job ID: 15730924
>         Array Job ID: 15730924_5
>         Cluster: monsoon
>         User/Group: cbn35/clusterstu
>         State: COMPLETED (exit code 0)
>         Nodes: 1
>         Cores per node: 2
>         CPU Utilized: 03:19:15
>         CPU Efficiency: 8790.44% of 00:02:16 core-walltime
>         Job Wall-clock time: 00:01:08
>         Memory Utilized: 0.00 MB (estimated maximum)
>         Memory Efficiency: 0.00% of 1.95 GB (1000.00 MB/core)
>         
>         
>         
>         ________________________________________________________
>         
>         
>         
>         
>         
>         As far as I can tell, I don't think a two core job with an elapsed 
> time of around one minute would have a cpu time of two hours. Could this be a 
> configuration issue, or is it a possible bug? 
>         
>         
>         More info is available on request, and any help is appreciated!
>         
>         
>         
>         
>         
>     
>     
> 

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/

Reply via email to