Re: [gridengine users] Jobs killed unexpectedly

Reuti Fri, 17 Feb 2012 15:29:22 -0800

Am 17.02.2012 um 21:11 schrieb Lane Schwartz:

> A number of my jobs keep dying, and I'm having trouble tracking down
> what's going on. Any tips or help would be greatly appreciated.
> 
> The job is a perl script that launches a binary (called moses) using
> the perl "system()" call. The end of the log file is below. I know
> that the perl script is responsible for printing out the last two
> lines (starting with "Exit code: 137"), but I can't figure out who is
> responsible for printing out the first line (starting with "sh: line
> 1: 29188 Killed"). I know that it's not the perl script, and I'm
> reasonably sure that it's not the moses binary.
> 
> I suspect that maybe the grid engine is killing the job, but I don't
> know how to track down that hypothesis. Here's the log:
> 
> sh: line 1: 29188 Killed
> /free/lane/slm-merging-trunk/moses-cmd/src/moses -config
> /scratch4/lane/2011-12-15_europarl/config/de-en/filtered/filtered.ttable20.dist05.synlm50.ini
> -inputtype 0 -w -0.178571 -slm 0.178571 -lm 0.089286 -d 0.053571
> 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm 0.035714
> 0.035714 0.035714 0.035714 0.035714 -n-best-list run1.best100.out 100
> -input-file /scratch4/lane/2011-12-15_europarl/corpus/dev.tok.norm.de
>> run1.out
> Exit code: 137
> The decoder died. CONFIG WAS -w -0.178571 -slm 0.178571 -lm 0.089286
> -d 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm
> 0.035714 0.035714 0.035714 0.035714 0.035714
> 
> 
> My understanding is that an exit code 137 indicates that the process
> received kill signal 9.


Yep. Anything in the messages file of the qmaster, 
$SGE_ROOT/default/spool/qmaster/message, or the one of the node quad19?

If there is such a configration, it would show up in the queue configuration:

$ qconf -sq all.q

But then the complete process group would be killed, and not only the called 
binary. As Perl is getting the result of the killed binary (and is still 
running on its own), it can even exit with an exit code of 2, but for SGE the 
job didn't fail and wasn't killed by it.

Another point to check: /var/log/mesages, whether the moses process was killed 
by the oom-killer (out-of-memory) or filled it any disk in /scratch?

As the maxvmem is mentioned with 25GB: did you setup/request the memory on the 
node to avoid running out of memory and it has enough of it?

-- Reuti

> For what it's worth, the results of running qacct -j on the job after
> it died are listed below.
> 
> ==============================================================
> qname        all.q
> hostname     quad19.scream.lab
> group        scream
> owner        lane
> project      NONE
> department   defaultdepartment
> jobname      de-en.mert
> jobnumber    20337
> taskid       undefined
> account      sge
> priority     0
> qsub_time    Mon Feb 13 14:08:54 2012
> start_time   Mon Feb 13 14:09:05 2012
> end_time     Wed Feb 15 14:54:52 2012
> granted_pe   NONE
> slots        1
> failed       0
> exit_status  2
> ru_wallclock 175547
> ru_utime     175460.360
> ru_stime     21.147
> ru_maxrss    23910412
> ru_ixrss     0
> ru_ismrss    0
> ru_idrss     0
> ru_isrss     0
> ru_minflt    6545996
> ru_majflt    7568
> ru_nswap     0
> ru_inblock   3067192
> ru_oublock   22064
> ru_msgsnd    0
> ru_msgrcv    0
> ru_nsignals  0
> ru_nvcsw     9545
> ru_nivcsw    256918
> cpu          175481.507
> mem          2516411.448
> io           4.733
> iow          0.000
> maxvmem      25.026G
> arid         undefined
> 
> 
> I'm running under OGS GE2011.11. A colleague suggested that there may
> be some sort of configuration where the grid engine is killing the
> jobs after 48 hours or so. I know that I've successfully run jobs
> longer than that under my old SGE setup, but not yet under the new OGS
> setup.
> 
> As far as I can tell, all of my hard and soft limits are set to INFINITY.
> 
> Thanks,
> Lane
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Jobs killed unexpectedly

Reply via email to