We're having a problem with submit scripts not being transferred to exec nodes and jobs being stuck in the [t]ransitioning state.
The issue is present with SoGE 8.1.6 and 8.1.9, under CentOS7. We are using classic spooling. On the compute nodes, the spool directory /var/tmp/gridengine/$SGE_VER/default/spool/$HOSTNAME/ exists, is owned by user 'sge' (running the execd), is writeable, and has space. There is successful communication between the qmaster and execd hosts: qping works in both directions jobs submitted as binaries (-b y) run correctly directives from the master to the execd (for example, to delete jobs) work If I read the qmaster debug logs correctly, it looks like the qmaster isn't able to send the submit script to the compute node: 1 worker001 debiting 8589934592.000000 of h_vmem on host 2115fmn001.foobar.local for 1 slots 2 worker001 debiting 4000000000.000000 of tmpfree on host 2115fmn001.foobar.local for 1 slots 3 worker001 debiting 1.000000 of jobs on queue all.q for 1 slots 4 worker001 debiting 1.000000 of slots on queue all.q for 1 slots 5 worker001 user doesn't match 6 worker001 user doesn't match 7 worker001 queue doesn't match 8 worker001 queue doesn't match 9 worker001 user doesn't match 10 worker001 user doesn't match 11 worker001 spooling job 9899430.1 <null> 12 worker001 Making dir "jobs/00/0989/9430/1-4096/1" 13 worker001 retval = 0 14 worker001 spooling job 9899430.1 <null> 15 worker001 Making dir "jobs/00/0989/9430" 16 worker001 retval = 0 17 worker001 TRIGGER JOB RESEND 9899430/1 in 300 seconds 18 worker001 successfully handed off job "9899430" to queue "all.q@2115fmn001.foobar.local" 19 worker001 NO TICKET DELIVERY We don't see corresponding log messages on the client. What mechanism is used by SGE to transfer submit scripts (something specific to GDI over the $SGE_EXECD_PORT, ssh, scp, something else)? What are the system-level requirements for succesfully sending the submit scripts (for example: same UID for sge across the cluster, same UID<->username for the user submitting the job across the cluster, etc)? Any troubleshooting suggestions? Thanks, Mark _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users