On Sun, 2018-04-29 at 15:20 -0700, Simon Matthews wrote: > What exactly is the ru_utime that qacct reports? I have a job that is > killed, after it hits the wallclock time limit of 2 hours, but > ru_utime is less than 1 second. We are using SoGE 8.1.8.
Hi Simon, [sorry this is delayed - I thought I'd sent this days ago, but just noticed it sitting in the drafts folder. But that allows me to add a bit based on someone else's post :-).] I recently had to answer this question for a user on our cluster. Here's my understanding of the output... > > > qname SmallTestcases > hostname h7-c6-64-1.sj.bps > group blue > owner build > project NONE > department defaultdepartment > jobname R-Clean-ExternalTC-Linux64-CentOS6-2018.2.46224 > jobnumber 3741169 > taskid undefined > account sge > priority 0 > qsub_time Fri Apr 27 23:37:38 2018 > start_time Sat Apr 28 01:01:13 2018 > end_time Sat Apr 28 03:01:21 2018 > granted_pe NONE > slots 1 > failed 100 : assumedly after job > exit_status 152 (CPU time limit exceeded) > ru_wallclock 7208s Wall clock is how much real world time passed while your job was running. This agrees with the start and end times above. > ru_utime 0.063s This is the userspace CPU time of the script/program that is controlled directly be the execd, and any threads it might have spawned. > ru_stime 0.428s This is the amount of CPU time that script/program spent on system calls. For most single process jobs, ru_utime and ru_stime will add up to the total CPU time your job used. But wait - there's more! This is for the main process and any threads/process is spawns using methods that are tightly integrated into GridEngine. See below. > ru_maxrss 5.398KB > ru_ixrss 0.000B > ru_ismrss 0.000B > ru_idrss 0.000B > ru_isrss 0.000B > ru_minflt 22744 > ru_majflt 0 > ru_nswap 0 > ru_inblock 0 > ru_oublock 3296 > ru_msgsnd 0 > ru_msgrcv 0 > ru_nsignals 0 > ru_nvcsw 765 > ru_nivcsw 155 > cpu 7200.360s This is the total CPU time (both userspace and system) for all processes in your job's process groups. This includes the one directly controlled by execd, and by any helper processes that might be spawned by methods that are not tightly integrated into SGE. We experienced a similar confusion earlier this month with some Matlab jobs, using Matlab's parallel processing toolkit. It appears that if the program controlled by execd spawns additional helper processes using methods GridEngine doesn't understand, then the CPU time for those helper processes is included here, but is not included in ru_utime/ru_stime. I'm not entirely clear on what process spawning methods are not tightly integrated. But this explains why the total CPU time your job used is not included in the ru_time/ru_stime. As Mike Serkov suggested, you can monitor the job on the node where its running to see more about what's it's actually up do. Something is taking up 7200 seconds of CPU time, which is why you're bumping into the CPU time limit. Also see this previous conversation: http://gridengine.org/pipermail/users/2014-January/007001.html for some additional details. Cheers, Chris > mem 606.323GBs > io 0.005GB > iow 0.000s > maxvmem 129.176MB > arid undefined > ar_sub_time undefined > category -U execd -q SmallTestcases -l mem_free=1G > > Simon > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users