We figured it out!
Specific user binary was not respecting vf memory complex and decided to use all the RAM on random nodes it landed on! How this generated a 137, and the explanation for what we were told a 137 meant really threw us off however! Cheers. --JC On 25/01/13 3:59 AM, "Dave Love" <[email protected]> wrote: >Jake Carroll <[email protected]> writes: > >> Hi. >> >> We've now shot the head node in the head (heh) and we're exploring >>killing >> off/restarting each execd on the compute nodes. >> >> Do you recommend a kill -HUP on the process, or something more >>aggressive? >> This will in theory "kill" currently executing jobs on each compute >>host, >> we're assuming? > >I can't remember what this refers to, but the init scripts for SGE 8 >have a "restart" option which does softstop+start. > >> Also, we just caught another one in the act, on one of the nodes that >>just >> threw the 137: >> >> [root@compute-0-6 ~]# tail -f >> /opt/gridengine/default/spool/compute-0-6/messages >> 01/17/2013 08:03:15| main|compute-0-6|W|reaping job "1371379" ptf >> complains: Job does not exist >> 01/17/2013 09:22:33| main|compute-0-6|W|reaping job "1371379" ptf >> complains: Job does not exist >> 01/17/2013 09:24:55| main|compute-0-6|W|reaping job "1371379" ptf >> complains: Job does not exist >> 01/17/2013 09:34:12| main|compute-0-6|W|reaping job "1371379" ptf >> complains: Job does not exist >> 01/17/2013 10:06:45| main|compute-0-6|E|removing unreferenced job >> 1371379.7545 without job report from ptf >> 01/17/2013 10:09:25| main|compute-0-6|W|reaping job "1371379" ptf >> complains: Job does not exist >> 01/18/2013 17:10:52| main|compute-0-6|W|can't register at qmaster >> "cluster.local": abort qmaster registration due to communication errors >> 01/18/2013 17:16:42| main|compute-0-6|W|gethostbyname(cluster.local) >>took >> 20 seconds and returns TRY_AGAIN >> >> 01/18/2013 17:25:37| main|compute-0-6|E|commlib error: got select error >> (No route to host) > >You'd better address the network errors before anything else. As in the >tracker, I don't know what causes the PTF errors, though. > >> What's most unusual, about this, is that these time stamps don't match >>up >> with the error 137 we just saw. > >Look in the messages files for what does. > >> This example job was running for two days or so, then just became >>unhappy >> today, then threw the 137: >> >> Job 1307803 (b5_set11_9) Complete >> User = someguy >> Queue = [email protected] >> Host = compute-0-6.local >> Start Time = 01/14/2013 14:22:12 >> End Time = 01/21/2013 12:23:02 > >That's nearly a week, not two days. > >-- >Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
