Am 13.03.2012 um 15:08 schrieb Lars van der bijl:

> On 13 March 2012 13:55, Reuti <[email protected]> wrote:
>> Am 13.03.2012 um 12:46 schrieb Lars van der bijl:
>> 
>>> On 13 March 2012 12:32, Reuti <[email protected]> wrote:
>>>> Am 13.03.2012 um 12:03 schrieb Lars van der bijl:
>>>> 
>>>>> On 13 March 2012 11:18, Reuti <[email protected]> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Am 13.03.2012 um 10:59 schrieb Lars van der bijl:
>>>>>> 
>>>>>>> Hey everyone,
>>>>>>> 
>>>>>>> Where having the following problem.
>>>>>>> 
>>>>>>> randomly on some task we start getting "CPU time limit exceeded". we
>>>>>> 
>>>>>> You notice that in the messages file of SGE on the execution host or 
>>>>>> where do you get the statement?
>>>>>> 
>>>>> 
>>>>> we get this in our stderr output.
>>>> 
>>>> Then I would say it's not a limit by SGE. Can you set up any time limit in 
>>>> the appliation itself?
>>> 
>>> not that I am aware of. the application is rendering a image and has
>>> nothing setup to kill it on time.
>>> we do have a limit on memory.
>>> 
>>> 
>>>> 
>>>> 
>>>>>>> don't specify a time limit. we do specify h_vmem.
>>>>>>> this only happens on some tasks and not other. even between same tasks
>>>>>>> from a batch on the same machine.
>>>>>> 
>>>>>> It could be a set limit in the queue definition (h_cpu), specified for 
>>>>>> some particular jobs (-l h_cpu=...).
>>>>>> 
>>>>>> The time for an SGE limit is usually mentioned in the messages file. Is 
>>>>>> it always the same time?
>>>>>> 
>>>>> 
>>>>> 03/13/2012 05:41:24|worker|nano|W|rescheduling job 61607.121
>>>>> 03/13/2012 05:41:24|worker|nano|W|job 61607.131 failed on host louie
>>>>> general rescheduling on application error because: 03/13/2012 05:41:23
>>>>> [0:10105]: exit_status of job start = 100
>>>> 
>>>> So, the job was rescheduled (do you know why?), but the restart failed and 
>>>> put the job in error status (because of exit code 100). Do you see this?
>>> 
>>> to force sge to error out or retry we check the exit status of the
>>> task in the prolog. if it anything other then 0 and it has re-tries it
>>> will exit 99 from the prolog. otherwise exit with 100.
>>> we always have task dependent on the output and we don't want them to start.
>>> 
>>> could a SIGXCPU
>> 
>> Yes, SIGXCPU will generate this error message.
> 
> I've put a trap in our run script to catch SIGXCPU SIGTERM and cause
> it to exit with 100. we where getting jobs being killed without good
> cause and starting up it's dependencies.
> that where the 100 comes from then i guess.
> 
> still no idea what could cause the SIGXCPU. could it be send by
> mem_free or s_vmem?

Yes, it's even send for s_vmem as warning (man queue_conf). You set s_vmem in 
addition to h_vmem?

-- Reuti


>> 
>> -- Reuti
>> 
>> 
>>> or a SIGTERM cause this?
>>> 
>>> 
>>>> 
>>>> Can you elaborate in some why what is going on there in detail - is it 
>>>> supposed to fail if it's just rescheduled without cleaning any former 
>>>> files or so?
>>>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>>> unless [0:10105] is the limit i'm not sure.
>>>>> 
>>>>> 
>>>>> 
>>>>>> -- Reuti
>>>> 
>> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to