Am 24.01.2013 um 21:20 schrieb Jake Carroll: > We figured it out! > > > Specific user binary was not respecting vf memory complex and decided to > use all the RAM on random nodes it landed on!
So it was killed by the oom-killer? Was this a hard limit h_vmem or only a complex? -- Reuti > How this generated a 137, and the explanation for what we were told a 137 > meant really threw us off however! > > Cheers. > > --JC > > On 25/01/13 3:59 AM, "Dave Love" <[email protected]> wrote: > >> Jake Carroll <[email protected]> writes: >> >>> Hi. >>> >>> We've now shot the head node in the head (heh) and we're exploring >>> killing >>> off/restarting each execd on the compute nodes. >>> >>> Do you recommend a kill -HUP on the process, or something more >>> aggressive? >>> This will in theory "kill" currently executing jobs on each compute >>> host, >>> we're assuming? >> >> I can't remember what this refers to, but the init scripts for SGE 8 >> have a "restart" option which does softstop+start. >> >>> Also, we just caught another one in the act, on one of the nodes that >>> just >>> threw the 137: >>> >>> [root@compute-0-6 ~]# tail -f >>> /opt/gridengine/default/spool/compute-0-6/messages >>> 01/17/2013 08:03:15| main|compute-0-6|W|reaping job "1371379" ptf >>> complains: Job does not exist >>> 01/17/2013 09:22:33| main|compute-0-6|W|reaping job "1371379" ptf >>> complains: Job does not exist >>> 01/17/2013 09:24:55| main|compute-0-6|W|reaping job "1371379" ptf >>> complains: Job does not exist >>> 01/17/2013 09:34:12| main|compute-0-6|W|reaping job "1371379" ptf >>> complains: Job does not exist >>> 01/17/2013 10:06:45| main|compute-0-6|E|removing unreferenced job >>> 1371379.7545 without job report from ptf >>> 01/17/2013 10:09:25| main|compute-0-6|W|reaping job "1371379" ptf >>> complains: Job does not exist >>> 01/18/2013 17:10:52| main|compute-0-6|W|can't register at qmaster >>> "cluster.local": abort qmaster registration due to communication errors >>> 01/18/2013 17:16:42| main|compute-0-6|W|gethostbyname(cluster.local) >>> took >>> 20 seconds and returns TRY_AGAIN >>> >>> 01/18/2013 17:25:37| main|compute-0-6|E|commlib error: got select error >>> (No route to host) >> >> You'd better address the network errors before anything else. As in the >> tracker, I don't know what causes the PTF errors, though. >> >>> What's most unusual, about this, is that these time stamps don't match >>> up >>> with the error 137 we just saw. >> >> Look in the messages files for what does. >> >>> This example job was running for two days or so, then just became >>> unhappy >>> today, then threw the 137: >>> >>> Job 1307803 (b5_set11_9) Complete >>> User = someguy >>> Queue = [email protected] >>> Host = compute-0-6.local >>> Start Time = 01/14/2013 14:22:12 >>> End Time = 01/21/2013 12:23:02 >> >> That's nearly a week, not two days. >> >> -- >> Community Grid Engine: http://arc.liv.ac.uk/SGE/ > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
