We figured it out!

Specific user binary was not respecting vf memory complex and decided to
use all the RAM on random nodes it landed on!

How this generated a 137, and the explanation for what we were told a 137
meant really threw us off however!

Cheers.

--JC

On 25/01/13 3:59 AM, "Dave Love" <[email protected]> wrote:

>Jake Carroll <[email protected]> writes:
>
>> Hi.
>>
>> We've now shot the head node in the head (heh) and we're exploring
>>killing
>> off/restarting each execd on the compute nodes.
>>
>> Do you recommend a kill -HUP on the process, or something more
>>aggressive?
>> This will in theory "kill" currently executing jobs on each compute
>>host,
>> we're assuming?
>
>I can't remember what this refers to, but the init scripts for SGE 8
>have a "restart" option which does softstop+start.
>
>> Also, we just caught another one in the act, on one of the nodes that
>>just
>> threw the 137:
>>
>> [root@compute-0-6 ~]# tail -f
>> /opt/gridengine/default/spool/compute-0-6/messages
>> 01/17/2013 08:03:15|  main|compute-0-6|W|reaping job "1371379" ptf
>> complains: Job does not exist
>> 01/17/2013 09:22:33|  main|compute-0-6|W|reaping job "1371379" ptf
>> complains: Job does not exist
>> 01/17/2013 09:24:55|  main|compute-0-6|W|reaping job "1371379" ptf
>> complains: Job does not exist
>> 01/17/2013 09:34:12|  main|compute-0-6|W|reaping job "1371379" ptf
>> complains: Job does not exist
>> 01/17/2013 10:06:45|  main|compute-0-6|E|removing unreferenced job
>> 1371379.7545 without job report from ptf
>> 01/17/2013 10:09:25|  main|compute-0-6|W|reaping job "1371379" ptf
>> complains: Job does not exist
>> 01/18/2013 17:10:52|  main|compute-0-6|W|can't register at qmaster
>> "cluster.local": abort qmaster registration due to communication errors
>> 01/18/2013 17:16:42|  main|compute-0-6|W|gethostbyname(cluster.local)
>>took
>> 20 seconds and returns TRY_AGAIN
>>
>> 01/18/2013 17:25:37|  main|compute-0-6|E|commlib error: got select error
>> (No route to host)
>
>You'd better address the network errors before anything else.  As in the
>tracker, I don't know what causes the PTF errors, though.
>
>> What's most unusual, about this, is that these time stamps don't match
>>up
>> with the error 137 we just saw.
>
>Look in the messages files for what does.
>
>> This example job was running for two days or so, then just became
>>unhappy
>> today, then threw the 137:
>>
>> Job 1307803 (b5_set11_9) Complete
>> User             = someguy
>> Queue            = [email protected]
>> Host             = compute-0-6.local
>> Start Time       = 01/14/2013 14:22:12
>> End Time         = 01/21/2013 12:23:02
>
>That's nearly a week, not two days.
>
>-- 
>Community Grid Engine:  http://arc.liv.ac.uk/SGE/


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to