Hi Reuti,

The jobs stay in the queue forever - and don't get processed.  There are no 
messages in the spool directory for these jobs.

Thomas
________________________________________
From: Reuti <re...@staff.uni-marburg.de>
Sent: Tuesday, December 20, 2016 4:25 PM
To: Thomas Beaudry
Cc: sge-discuss@liv.ac.uk
Subject: Re: [SGE-discuss] jobs stuck in 'r' state

Hi,

Am 20.12.2016 um 22:20 schrieb Thomas Beaudry:

> Hi,
>
> I've run into a problem recently where users jobs are stuck in the 'r' state. 
>  It doesn't always happen, but it's happening enough to be a persistent 
> error. My guess is that it is IO realted (the jobs are accessing a NFS 4.1. 
> share off of a windows 2012 file server).  I really don't know how to debug 
> this since I'm not getting any useful info from qstat -j <jobid>  and the 
> /var/log/* logs don't seem to give me any clues - or maybe i'm missin 
> something.
>
> I would be very greatful if anyone has any suggestions as to where I can 
> start to debug this issue.  My cluster is unusable because of this error.

You mean the job exited already and is not removed from `qstat`? Usually there 
is a delay of some minutes for parallel jobs.

What does the messages file in the spool directory of the nodes say? Unless 
it's local it's in $SGE_ROOT/default/spool/nodeXY/messages

-- Reuti


> Thanks,
> Thomas
> _______________________________________________
> SGE-discuss mailing list
> SGE-discuss@liv.ac.uk
> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to