In the message dated: Fri, 27 Sep 2019 23:32:43 +0200, The pithy ruminations from Reuti on [Re: [gridengine users] jobs stuck in transitioning state] were: => Hi, => => Am 27.09.2019 um 22:21 schrieb berg...@merctech.com: => => > We're having a problem with submit scripts not being transferred to exec => > nodes and jobs being stuck in the [t]ransitioning state. => => Did this issue to start out of the blue?
Not spontaneously. We had a working 8.1.6 cluster. Shut everything down for a storage update and switch from NIS => LDAP and change in DNS server. Upgraded SGE to 8.1.9 at the same time. Brought everything up, worked out all sorts of little things, then began having the problem with jobs getting stuck. Reverted to 8.1.6, problem still exists. => => Is the execd running as sge or initially as root? It must be run at root to be able to switch to any user but switches to the admin user: The execd and qmaster both start as root & then become the effective user 'sge'. => => > There is successful communication between the qmaster and execd hosts: => > => > qping works in both directions => > => > jobs submitted as binaries (-b y) run correctly => > => > directives from the master to the execd (for example, to delete jobs) work => > => > If I read the qmaster debug logs correctly, it looks like the qmaster isn't able to send the submit script to the compute node: => > => > 11 worker001 spooling job 9899430.1 <null> => > 12 worker001 Making dir "jobs/00/0989/9430/1-4096/1" => > 13 worker001 retval = 0 => > 14 worker001 spooling job 9899430.1 <null> => > 15 worker001 Making dir "jobs/00/0989/9430" => > 16 worker001 retval = 0 => > 17 worker001 TRIGGER JOB RESEND 9899430/1 in 300 seconds => > 18 worker001 successfully handed off job "9899430" to queue "all.q@2115fmn001.foobar.local" => > 19 worker001 NO TICKET DELIVERY => > => > => > We don't see corresponding log messages on the client. => > => > => > What mechanism is used by SGE to transfer submit scripts (something => > specific to GDI over the $SGE_EXECD_PORT, ssh, scp, something else)? => => It uses its own protocol. No SSH inside the cluster is necessary. That's what I thought...and there's no mechanism to change the file transfer method. => => => > What are the system-level requirements for succesfully sending the => > submit scripts (for example: same UID for sge across the cluster, same => > UID<->username for the user submitting the job across the cluster, etc)? Are there any other requirements you can think of? Thanks, Mark => => Yes. => => -- Reuti => _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users