Hi, Am 27.09.2019 um 22:21 schrieb berg...@merctech.com:
> We're having a problem with submit scripts not being transferred to exec > nodes and jobs being stuck in the [t]ransitioning state. Did this issue to start out of the blue? > The issue is present with SoGE 8.1.6 and 8.1.9, under CentOS7. But these are separate clusters, or you using both versions in one and the same cluster or just tried both on one cluster? > We are using classic spooling. On the compute nodes, the spool directory > /var/tmp/gridengine/$SGE_VER/default/spool/$HOSTNAME/ > exists, is owned by user 'sge' (running the execd), is writeable, and > has space. Is the execd running as sge or initially as root? It must be run at root to be able to switch to any user but switches to the admin user: $ ps -e f -o user,ruser,group,rgroup,command … sgeadmin root gridware root /usr/sge/bin/lx24-em64t/sge_execd root root root root \_ /bin/sh /usr/sge/cluster/tmpspace.sh sgeadmin root gridware root \_ sge_shepherd-311391 -bg > There is successful communication between the qmaster and execd hosts: > > qping works in both directions > > jobs submitted as binaries (-b y) run correctly > > directives from the master to the execd (for example, to delete jobs) > work > > If I read the qmaster debug logs correctly, it looks like the qmaster isn't > able to send the submit script to the compute node: > > 1 worker001 debiting 8589934592.000000 of h_vmem on host > 2115fmn001.foobar.local for 1 slots > 2 worker001 debiting 4000000000.000000 of tmpfree on host > 2115fmn001.foobar.local for 1 slots > 3 worker001 debiting 1.000000 of jobs on queue all.q for 1 slots > 4 worker001 debiting 1.000000 of slots on queue all.q for 1 slots > 5 worker001 user doesn't match > 6 worker001 user doesn't match > 7 worker001 queue doesn't match > 8 worker001 queue doesn't match > 9 worker001 user doesn't match > 10 worker001 user doesn't match > 11 worker001 spooling job 9899430.1 <null> > 12 worker001 Making dir "jobs/00/0989/9430/1-4096/1" > 13 worker001 retval = 0 > 14 worker001 spooling job 9899430.1 <null> > 15 worker001 Making dir "jobs/00/0989/9430" > 16 worker001 retval = 0 > 17 worker001 TRIGGER JOB RESEND 9899430/1 in 300 seconds > 18 worker001 successfully handed off job "9899430" to queue > "all.q@2115fmn001.foobar.local" > 19 worker001 NO TICKET DELIVERY > > > We don't see corresponding log messages on the client. > > > What mechanism is used by SGE to transfer submit scripts (something > specific to GDI over the $SGE_EXECD_PORT, ssh, scp, something else)? It uses its own protocol. No SSH inside the cluster is necessary. > What are the system-level requirements for succesfully sending the > submit scripts (for example: same UID for sge across the cluster, same > UID<->username for the user submitting the job across the cluster, etc)? Yes. -- Reuti _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users