This is a trivial recommendation but did you check the qmaster's message log?  
If so, can you find the section relevant to your job submission and send it to 
this mailing list?


Also, does your prolog script keep any logs that might be helpful??


These logs might help diagnose where the job execution is getting stuck.  From 
there, you can dig deeper into the problem.


Cheers,

Iyad


________________________________
From: Derrick Lin <klin...@gmail.com>
Sent: January 8, 2019 7:14 PM
To: SGE Mailing List
Subject: [gridengine users] qrsh session failed to execute prolog script?

Hi guys,

I just brought up a new SGE cluster, but somehow the qrsh session does not work:

tester@login-gpu:~$ qrsh
^Cerror: error while waiting for builtin IJS connection: "got select timeout"

after I hit entered, the session just stuck there forever instead of bring me 
to a compute node. I have to entered Crtl+c to terminate and it gave the above 
error.

I noticed, the SGE did send my qrsh request to a compute node as I could tell 
from qstat:

---------------------------------------------------------------------------------
short.q@zeta-4-15.local        BIP   0/1/80         0.01     lx-amd64
     15 0.55500 QRLOGIN    tester       r    01/09/2019 10:47:13     1

We have a prolog script configured globally, the script deals with local disk 
quota and keep all output to a log file for each job. So I went to that compute 
node, and check, found that a log file was created but it was empty.

So my thinking so far is, my qrsh stuck because the prolog script is not fully 
executed.

qsub job are working fine.

Any idea will be appreciated

Cheers,
Derrick
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to