Am 24.01.2013 um 21:20 schrieb Jake Carroll:

> We figured it out!
> 
> 
> Specific user binary was not respecting vf memory complex and decided to
> use all the RAM on random nodes it landed on!

So it was killed by the oom-killer?

Was this a hard limit h_vmem or only a complex?

-- Reuti


> How this generated a 137, and the explanation for what we were told a 137
> meant really threw us off however!
> 
> Cheers.
> 
> --JC
> 
> On 25/01/13 3:59 AM, "Dave Love" <[email protected]> wrote:
> 
>> Jake Carroll <[email protected]> writes:
>> 
>>> Hi.
>>> 
>>> We've now shot the head node in the head (heh) and we're exploring
>>> killing
>>> off/restarting each execd on the compute nodes.
>>> 
>>> Do you recommend a kill -HUP on the process, or something more
>>> aggressive?
>>> This will in theory "kill" currently executing jobs on each compute
>>> host,
>>> we're assuming?
>> 
>> I can't remember what this refers to, but the init scripts for SGE 8
>> have a "restart" option which does softstop+start.
>> 
>>> Also, we just caught another one in the act, on one of the nodes that
>>> just
>>> threw the 137:
>>> 
>>> [root@compute-0-6 ~]# tail -f
>>> /opt/gridengine/default/spool/compute-0-6/messages
>>> 01/17/2013 08:03:15|  main|compute-0-6|W|reaping job "1371379" ptf
>>> complains: Job does not exist
>>> 01/17/2013 09:22:33|  main|compute-0-6|W|reaping job "1371379" ptf
>>> complains: Job does not exist
>>> 01/17/2013 09:24:55|  main|compute-0-6|W|reaping job "1371379" ptf
>>> complains: Job does not exist
>>> 01/17/2013 09:34:12|  main|compute-0-6|W|reaping job "1371379" ptf
>>> complains: Job does not exist
>>> 01/17/2013 10:06:45|  main|compute-0-6|E|removing unreferenced job
>>> 1371379.7545 without job report from ptf
>>> 01/17/2013 10:09:25|  main|compute-0-6|W|reaping job "1371379" ptf
>>> complains: Job does not exist
>>> 01/18/2013 17:10:52|  main|compute-0-6|W|can't register at qmaster
>>> "cluster.local": abort qmaster registration due to communication errors
>>> 01/18/2013 17:16:42|  main|compute-0-6|W|gethostbyname(cluster.local)
>>> took
>>> 20 seconds and returns TRY_AGAIN
>>> 
>>> 01/18/2013 17:25:37|  main|compute-0-6|E|commlib error: got select error
>>> (No route to host)
>> 
>> You'd better address the network errors before anything else.  As in the
>> tracker, I don't know what causes the PTF errors, though.
>> 
>>> What's most unusual, about this, is that these time stamps don't match
>>> up
>>> with the error 137 we just saw.
>> 
>> Look in the messages files for what does.
>> 
>>> This example job was running for two days or so, then just became
>>> unhappy
>>> today, then threw the 137:
>>> 
>>> Job 1307803 (b5_set11_9) Complete
>>> User             = someguy
>>> Queue            = [email protected]
>>> Host             = compute-0-6.local
>>> Start Time       = 01/14/2013 14:22:12
>>> End Time         = 01/21/2013 12:23:02
>> 
>> That's nearly a week, not two days.
>> 
>> -- 
>> Community Grid Engine:  http://arc.liv.ac.uk/SGE/
> 
> 

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to