One of the moses developers pointed out to me that this looked like the OOM killer in action.
I looked in dmesg on the node: Out of memory: kill process 26224 (bash) score 1277831 or a child Killed process 26493 (moses) vsz:20220460kB, anon-rss:16643824kB, file-rss:404kb That node has 32 GB of RAM installed. So I guess the OS must have decided that moses was using too much memory and killed it. Mystery solved. Thanks, Lane On Tue, Feb 28, 2012 at 12:51 AM, Rayson Ho <[email protected]> wrote: > Hi Lane, > > Did you find out what's wrong with the jobs? > > Rayson > > > On Fri, Feb 17, 2012 at 3:46 PM, Rayson Ho <[email protected]> wrote: >> Nothing related job limit changed in GE2011.11. >> >> Most likely it is your shell limit (default login profile) getting >> into the process environment of your jobs. >> >> You can easily debug this by adding ulimit -a in your job script. >> >> Rayson >> >> >> >> On Fri, Feb 17, 2012 at 3:11 PM, Lane Schwartz <[email protected]> wrote: >>> Hi all, >>> >>> A number of my jobs keep dying, and I'm having trouble tracking down >>> what's going on. Any tips or help would be greatly appreciated. >>> >>> The job is a perl script that launches a binary (called moses) using >>> the perl "system()" call. The end of the log file is below. I know >>> that the perl script is responsible for printing out the last two >>> lines (starting with "Exit code: 137"), but I can't figure out who is >>> responsible for printing out the first line (starting with "sh: line >>> 1: 29188 Killed"). I know that it's not the perl script, and I'm >>> reasonably sure that it's not the moses binary. >>> >>> I suspect that maybe the grid engine is killing the job, but I don't >>> know how to track down that hypothesis. Here's the log: >>> >>> sh: line 1: 29188 Killed >>> /free/lane/slm-merging-trunk/moses-cmd/src/moses -config >>> /scratch4/lane/2011-12-15_europarl/config/de-en/filtered/filtered.ttable20.dist05.synlm50.ini >>> -inputtype 0 -w -0.178571 -slm 0.178571 -lm 0.089286 -d 0.053571 >>> 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm 0.035714 >>> 0.035714 0.035714 0.035714 0.035714 -n-best-list run1.best100.out 100 >>> -input-file /scratch4/lane/2011-12-15_europarl/corpus/dev.tok.norm.de >>>> run1.out >>> Exit code: 137 >>> The decoder died. CONFIG WAS -w -0.178571 -slm 0.178571 -lm 0.089286 >>> -d 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 0.053571 -tm >>> 0.035714 0.035714 0.035714 0.035714 0.035714 >>> >>> >>> My understanding is that an exit code 137 indicates that the process >>> received kill signal 9. >>> >>> >>> For what it's worth, the results of running qacct -j on the job after >>> it died are listed below. >>> >>> ============================================================== >>> qname all.q >>> hostname quad19.scream.lab >>> group scream >>> owner lane >>> project NONE >>> department defaultdepartment >>> jobname de-en.mert >>> jobnumber 20337 >>> taskid undefined >>> account sge >>> priority 0 >>> qsub_time Mon Feb 13 14:08:54 2012 >>> start_time Mon Feb 13 14:09:05 2012 >>> end_time Wed Feb 15 14:54:52 2012 >>> granted_pe NONE >>> slots 1 >>> failed 0 >>> exit_status 2 >>> ru_wallclock 175547 >>> ru_utime 175460.360 >>> ru_stime 21.147 >>> ru_maxrss 23910412 >>> ru_ixrss 0 >>> ru_ismrss 0 >>> ru_idrss 0 >>> ru_isrss 0 >>> ru_minflt 6545996 >>> ru_majflt 7568 >>> ru_nswap 0 >>> ru_inblock 3067192 >>> ru_oublock 22064 >>> ru_msgsnd 0 >>> ru_msgrcv 0 >>> ru_nsignals 0 >>> ru_nvcsw 9545 >>> ru_nivcsw 256918 >>> cpu 175481.507 >>> mem 2516411.448 >>> io 4.733 >>> iow 0.000 >>> maxvmem 25.026G >>> arid undefined >>> >>> >>> I'm running under OGS GE2011.11. A colleague suggested that there may >>> be some sort of configuration where the grid engine is killing the >>> jobs after 48 hours or so. I know that I've successfully run jobs >>> longer than that under my old SGE setup, but not yet under the new OGS >>> setup. >>> >>> As far as I can tell, all of my hard and soft limits are set to INFINITY. >>> >>> Thanks, >>> Lane >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users -- When a place gets crowded enough to require ID's, social collapse is not far away. It is time to go elsewhere. The best thing about space travel is that it made it possible to go elsewhere. -- R.A. Heinlein, "Time Enough For Love" _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
