Am 13.03.2012 um 12:46 schrieb Lars van der bijl:

> On 13 March 2012 12:32, Reuti <[email protected]> wrote:
>> Am 13.03.2012 um 12:03 schrieb Lars van der bijl:
>> 
>>> On 13 March 2012 11:18, Reuti <[email protected]> wrote:
>>>> Hi,
>>>> 
>>>> Am 13.03.2012 um 10:59 schrieb Lars van der bijl:
>>>> 
>>>>> Hey everyone,
>>>>> 
>>>>> Where having the following problem.
>>>>> 
>>>>> randomly on some task we start getting "CPU time limit exceeded". we
>>>> 
>>>> You notice that in the messages file of SGE on the execution host or where 
>>>> do you get the statement?
>>>> 
>>> 
>>> we get this in our stderr output.
>> 
>> Then I would say it's not a limit by SGE. Can you set up any time limit in 
>> the appliation itself?
> 
> not that I am aware of. the application is rendering a image and has
> nothing setup to kill it on time.
> we do have a limit on memory.
> 
> 
>> 
>> 
>>>>> don't specify a time limit. we do specify h_vmem.
>>>>> this only happens on some tasks and not other. even between same tasks
>>>>> from a batch on the same machine.
>>>> 
>>>> It could be a set limit in the queue definition (h_cpu), specified for 
>>>> some particular jobs (-l h_cpu=...).
>>>> 
>>>> The time for an SGE limit is usually mentioned in the messages file. Is it 
>>>> always the same time?
>>>> 
>>> 
>>> 03/13/2012 05:41:24|worker|nano|W|rescheduling job 61607.121
>>> 03/13/2012 05:41:24|worker|nano|W|job 61607.131 failed on host louie
>>> general rescheduling on application error because: 03/13/2012 05:41:23
>>> [0:10105]: exit_status of job start = 100
>> 
>> So, the job was rescheduled (do you know why?), but the restart failed and 
>> put the job in error status (because of exit code 100). Do you see this?
> 
> to force sge to error out or retry we check the exit status of the
> task in the prolog. if it anything other then 0 and it has re-tries it
> will exit 99 from the prolog. otherwise exit with 100.
> we always have task dependent on the output and we don't want them to start.
> 
> could a SIGXCPU

Yes, SIGXCPU will generate this error message.

-- Reuti


> or a SIGTERM cause this?
> 
> 
>> 
>> Can you elaborate in some why what is going on there in detail - is it 
>> supposed to fail if it's just rescheduled without cleaning any former files 
>> or so?
>> 
>> -- Reuti
>> 
>> 
>>> unless [0:10105] is the limit i'm not sure.
>>> 
>>> 
>>> 
>>>> -- Reuti
>> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to