Hi,

> Am 15.11.2016 um 15:14 schrieb Manfred Selz <manfred.s...@diasemi.com>:
> 
> Hi,
>  
> similar issues have been reported a long time ago, but I haven’t seen a 
> recent solution to this.
>  
> In one of our company’s SGE 6.2.u5 clusters, qrsh/qlogin jobs fail on 
> selected hosts with messages like this:
>  
> $  qrsh -l rhel=6,login=1,hostname=casrvodc-17 -verbose
> ...
> Your job 1756874 ("QRLOGIN") has been submitted                               
>        
> waiting for interactive job to be scheduled ...timeout (3 s) expired while 
> waiting on socket fd 4
>  
> Your interactive job 1756874 has been successfully scheduled.
> timeout (5 s) expired while waiting on socket fd 4  

Did you enable any firewall in the cluster to block certain ports on the nodes?

-- Reuti


>      This goes for some time, the jobs can even be seen briefly via qstat - 
> however, the jobs never really kick in, switch themselves to “dr” stated and 
> are finally gone (after a minute or so).
> The exec host’s messages file has lines like this:
>  
> 11/15/2016 05:59:50|  main|casrvodc-17|I|SIGNAL jid: 1756876 jatask: 1 
> signal: KILL
>  
> The main messages file has this:
>  
> 11/15/2016 05:59:50|worker|casrvodc-01|I|mselz has registered the job 1756876 
> for deletion
> 11/15/2016 05:59:51|worker|casrvodc-01|I|removing trigger to terminate job 
> 1756876.1
> 11/15/2016 05:59:51|worker|casrvodc-01|W|job 1756876.1 failed on host 
> casrvodc-17.diasemi.com assumedly after job because: job 1756876.1 died 
> through signal KILL (9)
>  
> Until a few days ago, qrsh used to work on all hosts in the cluster, and this 
> suddenly stopped for most (but not all!) of them, without a deliberate change 
> in SGE config or host config (for instance, “uptime” confirms that the hosts 
> have not been recently rebooted. Otherwise, the hosts in the cluster are all 
> of same type (hardware), kernel version, etc., with no significant difference 
> I have been able to identify yet.
>  
> For the same hosts, also a “qsub -now y” fails.
>  
> I have verified proper sge execd operation and host identification with 
> “qping”, “gethostbyaddr”, and “gethostbyname”, and this looks all fine.
>  
> Currently I am quite puzzled - I’d appreciate any input somebody may have on 
> how to further debug or resolve.
>  
> Best regards,
> Manfred
>  
> 
> 
> 
> Dialog Semiconductor GmbH
> Neue Str. 95
> D-73230 Kirchheim
> Managing Directors: Dr. Jalal Bagherli, Carsten Dahl
> Chairman of the Supervisory Board: Rich Beyer
> Commercial register: Amtsgericht Stuttgart: HRB 231181
> UST-ID-Nr. DE 811121668
> 
> Legal Disclaimer: This e-mail communication (and any attachment/s) is 
> confidential and contains proprietary information, some or all of which may 
> be legally privileged. It is intended solely for the use of the individual or 
> entity to which it is addressed. Access to this email by anyone else is 
> unauthorized. If you are not the intended recipient, any disclosure, copying, 
> distribution or any action taken or omitted to be taken in reliance on it, is 
> prohibited and may be unlawful.
> 
> 
> Please consider the environment before printing this e-mail
>  
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to