On Sun, 2018-04-29 at 15:20 -0700, Simon Matthews wrote:
> What exactly is the ru_utime that qacct reports? I have a job that is
> killed, after it hits the wallclock time limit of 2 hours, but
> ru_utime is less than 1 second. We are using SoGE 8.1.8.

Hi Simon,

[sorry this is delayed - I thought I'd sent this days ago, but just
noticed it sitting in the drafts folder.  But that allows me to add a
bit based on someone else's post :-).]

I recently had to answer this question for a user on our cluster.
 Here's my understanding of the output...

> 
> 
> qname        SmallTestcases
> hostname     h7-c6-64-1.sj.bps
> group        blue
> owner        build
> project      NONE
> department   defaultdepartment
> jobname      R-Clean-ExternalTC-Linux64-CentOS6-2018.2.46224
> jobnumber    3741169
> taskid       undefined
> account      sge
> priority     0
> qsub_time    Fri Apr 27 23:37:38 2018
> start_time   Sat Apr 28 01:01:13 2018
> end_time     Sat Apr 28 03:01:21 2018
> granted_pe   NONE
> slots        1
> failed       100 : assumedly after job
> exit_status  152                  (CPU time limit exceeded)
> ru_wallclock 7208s

Wall clock is how much real world time passed while your job was
running.  This agrees with the start and end times above.

> ru_utime     0.063s

This is the userspace CPU time of the script/program that is controlled
directly be the execd, and any threads it might have spawned.

> ru_stime     0.428s

This is the amount of CPU time that script/program spent on system
calls.  For most single process jobs, ru_utime and ru_stime will add up
to the total CPU time your job used.

But wait - there's more!  This is for the main process and any
threads/process is spawns using methods that are tightly integrated
into GridEngine.  See below.

> ru_maxrss    5.398KB
> ru_ixrss     0.000B
> ru_ismrss    0.000B
> ru_idrss     0.000B
> ru_isrss     0.000B
> ru_minflt    22744
> ru_majflt    0
> ru_nswap     0
> ru_inblock   0
> ru_oublock   3296
> ru_msgsnd    0
> ru_msgrcv    0
> ru_nsignals  0
> ru_nvcsw     765
> ru_nivcsw    155
> cpu          7200.360s

This is the total CPU time (both userspace and system) for all
processes in your job's process groups.  This includes the one directly
controlled by execd, and by any helper processes that might be spawned
by methods that are not tightly integrated into SGE.

We experienced a similar confusion earlier this month with some Matlab
jobs, using Matlab's parallel processing toolkit.  It appears that if
the program controlled by execd spawns additional helper processes
using methods GridEngine doesn't understand, then the CPU time for
those helper processes is included here, but is not included in
ru_utime/ru_stime.

I'm not entirely clear on what process spawning methods are not tightly
integrated.  But this explains why the total CPU time your job used is
not included in the ru_time/ru_stime.

As Mike Serkov suggested, you can monitor the job on the node where its
running to see more about what's it's actually up do.  Something is
taking up 7200 seconds of CPU time, which is why you're bumping into
the CPU time limit.

Also see this previous conversation:
    http://gridengine.org/pipermail/users/2014-January/007001.html
for some additional details.

                                        Cheers,
                                                Chris

> mem          606.323GBs
> io           0.005GB
> iow          0.000s
> maxvmem      129.176MB
> arid         undefined
> ar_sub_time  undefined
> category     -U execd -q SmallTestcases -l mem_free=1G
> 
> Simon
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to