I'm surprised no one else is seeing this issue? I wonder if you have 18.08 you 
can take a moment and run jobeff on a job in one of your users job arrays. I'm 
guessing jobeff will show the same issue as we are seeing. The issue is that 
usercpu is incorrect, and off by many orders of magnitude.


Christopher Coffey
High-Performance Computing
Northern Arizona University

On 12/21/18, 2:41 PM, "Christopher Benjamin Coffey" <chris.cof...@nau.edu> 

    So this issue is occurring only with job arrays.
    Christopher Coffey
    High-Performance Computing
    Northern Arizona University
    On 12/21/18, 12:15 PM, "slurm-users on behalf of Chance Bryce Carl Nelson" 
<slurm-users-boun...@lists.schedmd.com on behalf of chance-nel...@nau.edu> 
        Hi folks,
        calling sacct with the usercpu flag enabled seems to provide cpu times 
far above expected values for job array indices. This is also reported by seff. 
For example, executing the following job script:
        #SBATCH --job-name=array_test                   
        #SBATCH --workdir=/scratch/cbn35/bigdata          
        #SBATCH --output=/scratch/cbn35/bigdata/logs/job_%A_%a.log
        #SBATCH --time=20:00  
        #SBATCH --array=1-5
        #SBATCH -c2
        srun stress -c 2 -m 1 --vm-bytes 500M --timeout 65s
        ...results in the following stats:
               JobID  ReqCPUS    UserCPU  Timelimit    Elapsed 
        ------------ -------- ---------- ---------- ---------- 
        15730924_5          2   02:30:14   00:20:00   00:01:08 
        15730924_5.+        2  00:00.004              00:01:08 
        15730924_5.+        2   00:00:00              00:01:09 
        15730924_5.0        2   02:30:14              00:01:05 
        15730924_1          2   02:30:48   00:20:00   00:01:08 
        15730924_1.+        2  00:00.013              00:01:08 
        15730924_1.+        2   00:00:00              00:01:09 
        15730924_1.0        2   02:30:48              00:01:05 
        15730924_2          2   02:15:52   00:20:00   00:01:07 
        15730924_2.+        2  00:00.007              00:01:07 
        15730924_2.+        2   00:00:00              00:01:07 
        15730924_2.0        2   02:15:52              00:01:06 
        15730924_3          2   02:30:20   00:20:00   00:01:08 
        15730924_3.+        2  00:00.010              00:01:08 
        15730924_3.+        2   00:00:00              00:01:09 
        15730924_3.0        2   02:30:20              00:01:05 
        15730924_4          2   02:30:26   00:20:00   00:01:08 
        15730924_4.+        2  00:00.006              00:01:08 
        15730924_4.+        2   00:00:00              00:01:09 
        15730924_4.0        2   02:30:25              00:01:05 
        This is also reported by seff, with several errors to boot:
        Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff 
line 130, <DATA> line 624.
        Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff 
line 130, <DATA> line 624.
        Use of uninitialized value $lmem in numeric lt (<) at /usr/bin/seff 
line 130, <DATA> line 624.
        Job ID: 15730924
        Array Job ID: 15730924_5
        Cluster: monsoon
        User/Group: cbn35/clusterstu
        State: COMPLETED (exit code 0)
        Nodes: 1
        Cores per node: 2
        CPU Utilized: 03:19:15
        CPU Efficiency: 8790.44% of 00:02:16 core-walltime
        Job Wall-clock time: 00:01:08
        Memory Utilized: 0.00 MB (estimated maximum)
        Memory Efficiency: 0.00% of 1.95 GB (1000.00 MB/core)
        As far as I can tell, I don't think a two core job with an elapsed time 
of around one minute would have a cpu time of two hours. Could this be a 
configuration issue, or is it a possible bug? 
        More info is available on request, and any help is appreciated!

Reply via email to