Hi,

I've run into a problem recently where users jobs are stuck in the 'r' state.  
It doesn't always happen, but it's happening enough to be a persistent error. 
My guess is that it is IO realted (the jobs are accessing a NFS 4.1. share off 
of a windows 2012 file server).  I really don't know how to debug this since 
I'm not getting any useful info from qstat -j <jobid>  and the /var/log/* logs 
don't seem to give me any clues - or maybe i'm missin something.


I would be very greatful if anyone has any suggestions as to where I can start 
to debug this issue.  My cluster is unusable because of this error.


Thanks,

Thomas
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to