I have a problem with gridengine that disappeared for a long time but has come back.
Some user's jobs fail persistently without any output files This is a sample of the messages file on a node in E state, spooling is local in /var/spool/sge/cn385 which is owned by the sge admin account. 02/27/2012 12:50:32|execd|cn385|I|sending job start mail to user "xxx"|mailer "/bin/mail"|"Job 28182 (run.sge) Started" 02/27/2012 12:50:32|execd|cn385|E|shepherd of job 28182.1 exited with exit status = 11 02/27/2012 12:50:32|execd|cn385|I|sending admin mail mail to user " admin "|mailer "/bin/mail"|"GE 6.1u6: Job 28182 failed" 02/27/2012 12:51:06|execd|cn385|I|sending job abortion/end mail to user "xxx"|mailer "/bin/mail"|"Job 28182 (run.sge) Aborted" This is the corresponding error message in qmaster/messages: 02/27/2012 12:51:06|qmaster|ham4|W|job 28182.1 failed on host cn385 general before job because: 02/27/2012 12:50:32 [17813:30041]: unable to find job file "/var/spool/sge/cn385/job_scripts/28182" 02/27/2012 12:51:06|qmaster|ham4|W|rescheduling job 28182.1 02/27/2012 12:51:06|qmaster|ham4|E|queue blades.q marked QERROR as result of job 28182's failure at host cn385 It complains about not able to find the job file, although there is sufficient space on the node's local disk. The version is 6.1u6 binaries on Centos6. 6.2u5 was tried in the past but had some problems with array jobs crashing. I would be grateful for any advise Henk _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
