Hi all, A number of my jobs keep dying, and I'm having trouble tracking down what's going on. Any tips or help would be greatly appreciated.
The job is a perl script that launches a binary (called moses) using the perl "system()" call. The end of the log file is below. I know that the perl script is responsible for printing out the last two lines (starting with "Exit code: 137"), but I can't figure out who is responsible for printing out the first line (starting with "sh: line 1: 29188 Killed"). I know that it's not the perl script, and I'm reasonably sure that it's not the moses binary. I suspect that maybe the grid engine is killing the job, but I don't know how to track down that hypothesis. Here's the log: sh: line 1: 29188 Killed /free/lane/slm-merging-trunk/moses-cmd/src/moses -config /scratch4/lane/2011-12-15_europarl/config/de-en/filtered/filtered.ttable20.dist05.synlm50.ini -inputtype 0 -w -0.178571 -slm 0.178571 -lm 0.089286 -d 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm 0.035714 0.035714 0.035714 0.035714 0.035714 -n-best-list run1.best100.out 100 -input-file /scratch4/lane/2011-12-15_europarl/corpus/dev.tok.norm.de > run1.out Exit code: 137 The decoder died. CONFIG WAS -w -0.178571 -slm 0.178571 -lm 0.089286 -d 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm 0.035714 0.035714 0.035714 0.035714 0.035714 My understanding is that an exit code 137 indicates that the process received kill signal 9. For what it's worth, the results of running qacct -j on the job after it died are listed below. ============================================================== qname all.q hostname quad19.scream.lab group scream owner lane project NONE department defaultdepartment jobname de-en.mert jobnumber 20337 taskid undefined account sge priority 0 qsub_time Mon Feb 13 14:08:54 2012 start_time Mon Feb 13 14:09:05 2012 end_time Wed Feb 15 14:54:52 2012 granted_pe NONE slots 1 failed 0 exit_status 2 ru_wallclock 175547 ru_utime 175460.360 ru_stime 21.147 ru_maxrss 23910412 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 6545996 ru_majflt 7568 ru_nswap 0 ru_inblock 3067192 ru_oublock 22064 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 9545 ru_nivcsw 256918 cpu 175481.507 mem 2516411.448 io 4.733 iow 0.000 maxvmem 25.026G arid undefined I'm running under OGS GE2011.11. A colleague suggested that there may be some sort of configuration where the grid engine is killing the jobs after 48 hours or so. I know that I've successfully run jobs longer than that under my old SGE setup, but not yet under the new OGS setup. As far as I can tell, all of my hard and soft limits are set to INFINITY. Thanks, Lane _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
