Hi Alex, That's the correct behavior (for SSTATE_OPEN_OUTPUT), or else a user can DoS the cluster easily by pointing the input or output file to a path that can't be opened by the user.
Rayson On Tue, Sep 4, 2012 at 2:50 PM, Alex Chekholko <[email protected]> wrote: > Hi, > > I have a cluster with Rayson's OGE from Oct 2011. > > I see an unusual issue: our queue instances don't error out when a user's > job fails. > > We have an underlying issue with the filesystem, and sometimes the compute > nodes lose filesystem access. A job gets dispatched, errors out with > > failed 26 : opening input/output file > > and then lots of other jobs go to that same node and error out before the > filesystem comes back. > > IIRC, the queue should switch to error state when the first job errors out. > But this isn't happening here. Is there some setting I can check? > > I see the documentation says "A job enters the error state when Grid Engine > tried to execute a job in a queue, but it failed for a reason that is > considered specific to the job. A queue enters the error state when Grid > Engine tried to execute a job in a queue, but it failed for a reason that is > considered specific to the queue." per > http://arc.liv.ac.uk/SGE/howto/troubleshooting.html > > We also have a load sensor that checks for the presence of this filesystem, > but the load sensor only updates every few minutes, while the filesystem > tends to disappear for only about 60s. > > Regards, > -- > Alex Chekholko [email protected] > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
