[gridengine users] jobs stuck in transitioning state

bergman Fri, 27 Sep 2019 13:24:13 -0700

We're having a problem with submit scripts not being transferred to exec
nodes and jobs being stuck in the [t]ransitioning state.


The issue is present with SoGE 8.1.6 and 8.1.9, under CentOS7.

We are using classic spooling. On the compute nodes, the spool directory
        /var/tmp/gridengine/$SGE_VER/default/spool/$HOSTNAME/
exists, is owned by user 'sge' (running the execd), is writeable, and
has space.


There is successful communication between the qmaster and execd hosts:
                
        qping works in both directions

        jobs submitted as binaries (-b y) run correctly

        directives from the master to the execd (for example, to delete jobs) 
work

If I read the qmaster debug logs correctly, it looks like the qmaster isn't 
able to send the submit script to the compute node:

     1      worker001     debiting 8589934592.000000 of h_vmem on host 
2115fmn001.foobar.local for 1 slots
     2      worker001     debiting 4000000000.000000 of tmpfree on host 
2115fmn001.foobar.local for 1 slots
     3      worker001     debiting 1.000000 of jobs on queue all.q for 1 slots
     4      worker001     debiting 1.000000 of slots on queue all.q for 1 slots
     5      worker001     user doesn't match
     6      worker001     user doesn't match
     7      worker001     queue doesn't match
     8      worker001     queue doesn't match
     9      worker001     user doesn't match
    10      worker001     user doesn't match
    11      worker001     spooling job 9899430.1 <null>
    12      worker001     Making dir "jobs/00/0989/9430/1-4096/1"
    13      worker001     retval = 0
    14      worker001     spooling job 9899430.1 <null>
    15      worker001     Making dir "jobs/00/0989/9430"
    16      worker001     retval = 0
    17      worker001     TRIGGER JOB RESEND 9899430/1 in 300 seconds
    18      worker001     successfully handed off job "9899430" to queue 
"all.q@2115fmn001.foobar.local"
    19      worker001     NO TICKET DELIVERY


We don't see corresponding log messages on the client.


What mechanism is used by SGE to transfer submit scripts (something
specific to GDI over the $SGE_EXECD_PORT, ssh, scp, something else)?

What are the system-level requirements for succesfully sending the
submit scripts (for example: same UID for sge across the cluster, same
UID<->username for the user submitting the job across the cluster, etc)?

Any troubleshooting suggestions?

Thanks,

Mark

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] jobs stuck in transitioning state

Reply via email to