Hi,
I've run into a problem recently where users jobs are stuck in the 'r' state. It doesn't always happen, but it's happening enough to be a persistent error. My guess is that it is IO realted (the jobs are accessing a NFS 4.1. share off of a windows 2012 file server). I really don't know how to debug this since I'm not getting any useful info from qstat -j <jobid> and the /var/log/* logs don't seem to give me any clues - or maybe i'm missin something. I would be very greatful if anyone has any suggestions as to where I can start to debug this issue. My cluster is unusable because of this error. Thanks, Thomas
_______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss