Am 13.03.2012 um 12:46 schrieb Lars van der bijl: > On 13 March 2012 12:32, Reuti <[email protected]> wrote: >> Am 13.03.2012 um 12:03 schrieb Lars van der bijl: >> >>> On 13 March 2012 11:18, Reuti <[email protected]> wrote: >>>> Hi, >>>> >>>> Am 13.03.2012 um 10:59 schrieb Lars van der bijl: >>>> >>>>> Hey everyone, >>>>> >>>>> Where having the following problem. >>>>> >>>>> randomly on some task we start getting "CPU time limit exceeded". we >>>> >>>> You notice that in the messages file of SGE on the execution host or where >>>> do you get the statement? >>>> >>> >>> we get this in our stderr output. >> >> Then I would say it's not a limit by SGE. Can you set up any time limit in >> the appliation itself? > > not that I am aware of. the application is rendering a image and has > nothing setup to kill it on time. > we do have a limit on memory. > > >> >> >>>>> don't specify a time limit. we do specify h_vmem. >>>>> this only happens on some tasks and not other. even between same tasks >>>>> from a batch on the same machine. >>>> >>>> It could be a set limit in the queue definition (h_cpu), specified for >>>> some particular jobs (-l h_cpu=...). >>>> >>>> The time for an SGE limit is usually mentioned in the messages file. Is it >>>> always the same time? >>>> >>> >>> 03/13/2012 05:41:24|worker|nano|W|rescheduling job 61607.121 >>> 03/13/2012 05:41:24|worker|nano|W|job 61607.131 failed on host louie >>> general rescheduling on application error because: 03/13/2012 05:41:23 >>> [0:10105]: exit_status of job start = 100 >> >> So, the job was rescheduled (do you know why?), but the restart failed and >> put the job in error status (because of exit code 100). Do you see this? > > to force sge to error out or retry we check the exit status of the > task in the prolog. if it anything other then 0 and it has re-tries it > will exit 99 from the prolog. otherwise exit with 100. > we always have task dependent on the output and we don't want them to start. > > could a SIGXCPU
Yes, SIGXCPU will generate this error message. -- Reuti > or a SIGTERM cause this? > > >> >> Can you elaborate in some why what is going on there in detail - is it >> supposed to fail if it's just rescheduled without cleaning any former files >> or so? >> >> -- Reuti >> >> >>> unless [0:10105] is the limit i'm not sure. >>> >>> >>> >>>> -- Reuti >> _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
