Hi Reuti, The jobs stay in the queue forever - and don't get processed. There are no messages in the spool directory for these jobs.
Thomas ________________________________________ From: Reuti <re...@staff.uni-marburg.de> Sent: Tuesday, December 20, 2016 4:25 PM To: Thomas Beaudry Cc: sge-discuss@liv.ac.uk Subject: Re: [SGE-discuss] jobs stuck in 'r' state Hi, Am 20.12.2016 um 22:20 schrieb Thomas Beaudry: > Hi, > > I've run into a problem recently where users jobs are stuck in the 'r' state. > It doesn't always happen, but it's happening enough to be a persistent > error. My guess is that it is IO realted (the jobs are accessing a NFS 4.1. > share off of a windows 2012 file server). I really don't know how to debug > this since I'm not getting any useful info from qstat -j <jobid> and the > /var/log/* logs don't seem to give me any clues - or maybe i'm missin > something. > > I would be very greatful if anyone has any suggestions as to where I can > start to debug this issue. My cluster is unusable because of this error. You mean the job exited already and is not removed from `qstat`? Usually there is a delay of some minutes for parallel jobs. What does the messages file in the spool directory of the nodes say? Unless it's local it's in $SGE_ROOT/default/spool/nodeXY/messages -- Reuti > Thanks, > Thomas > _______________________________________________ > SGE-discuss mailing list > SGE-discuss@liv.ac.uk > https://arc.liv.ac.uk/mailman/listinfo/sge-discuss _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss