Hi Reuti,

thank you for your quick response.
No, I am not aware of any firewall changes in the cluster, but just to make 
sure I will double check with our IT.
Normally, when a firewall blocks something, I can see it in the "dmesg" 
messages, and there is nothing related to the SGE ports we use.

Regards,
Manfred


-----Original Message-----
From: Reuti [mailto:re...@staff.uni-marburg.de]
Sent: Dienstag, 15. November 2016 15:31
To: Manfred Selz
Cc: users@gridengine.org
Subject: Re: [gridengine users] issue with qrsh "waiting on socket fd 4" in SGE 
6.2u5

Hi,

> Am 15.11.2016 um 15:14 schrieb Manfred Selz <manfred.s...@diasemi.com>:
>
> Hi,
>
> similar issues have been reported a long time ago, but I haven’t seen a 
> recent solution to this.
>
> In one of our company’s SGE 6.2.u5 clusters, qrsh/qlogin jobs fail on 
> selected hosts with messages like this:
>
> $  qrsh -l rhel=6,login=1,hostname=casrvodc-17 -verbose ...
> Your job 1756874 ("QRLOGIN") has been submitted
> waiting for interactive job to be scheduled ...timeout (3 s) expired
> while waiting on socket fd 4
>
> Your interactive job 1756874 has been successfully scheduled.
> timeout (5 s) expired while waiting on socket fd 4

Did you enable any firewall in the cluster to block certain ports on the nodes?

-- Reuti


>      This goes for some time, the jobs can even be seen briefly via qstat - 
> however, the jobs never really kick in, switch themselves to “dr” stated and 
> are finally gone (after a minute or so).
> The exec host’s messages file has lines like this:
>
> 11/15/2016 05:59:50|  main|casrvodc-17|I|SIGNAL jid: 1756876 jatask: 1
> signal: KILL
>
> The main messages file has this:
>
> 11/15/2016 05:59:50|worker|casrvodc-01|I|mselz has registered the job
> 1756876 for deletion
> 11/15/2016 05:59:51|worker|casrvodc-01|I|removing trigger to terminate
> job 1756876.1
> 11/15/2016 05:59:51|worker|casrvodc-01|W|job 1756876.1 failed on host
> casrvodc-17.diasemi.com assumedly after job because: job 1756876.1
> died through signal KILL (9)
>
> Until a few days ago, qrsh used to work on all hosts in the cluster, and this 
> suddenly stopped for most (but not all!) of them, without a deliberate change 
> in SGE config or host config (for instance, “uptime” confirms that the hosts 
> have not been recently rebooted. Otherwise, the hosts in the cluster are all 
> of same type (hardware), kernel version, etc., with no significant difference 
> I have been able to identify yet.
>
> For the same hosts, also a “qsub -now y” fails.
>
> I have verified proper sge execd operation and host identification with 
> “qping”, “gethostbyaddr”, and “gethostbyname”, and this looks all fine.
>
> Currently I am quite puzzled - I’d appreciate any input somebody may have on 
> how to further debug or resolve.
>
> Best regards,
> Manfred
>
>
>
>
> Dialog Semiconductor GmbH
> Neue Str. 95
> D-73230 Kirchheim
> Managing Directors: Dr. Jalal Bagherli, Carsten Dahl Chairman of the
> Supervisory Board: Rich Beyer Commercial register: Amtsgericht
> Stuttgart: HRB 231181 UST-ID-Nr. DE 811121668
>
> Legal Disclaimer: This e-mail communication (and any attachment/s) is 
> confidential and contains proprietary information, some or all of which may 
> be legally privileged. It is intended solely for the use of the individual or 
> entity to which it is addressed. Access to this email by anyone else is 
> unauthorized. If you are not the intended recipient, any disclosure, copying, 
> distribution or any action taken or omitted to be taken in reliance on it, is 
> prohibited and may be unlawful.
>
>
> Please consider the environment before printing this e-mail
>
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

________________________________

Dialog Semiconductor GmbH
Neue Str. 95
D-73230 Kirchheim
Managing Directors: Dr. Jalal Bagherli, Carsten Dahl
Chairman of the Supervisory Board: Rich Beyer
Commercial register: Amtsgericht Stuttgart: HRB 231181
UST-ID-Nr. DE 811121668

Legal Disclaimer: This e-mail communication (and any attachment/s) is 
confidential and contains proprietary information, some or all of which may be 
legally privileged. It is intended solely for the use of the individual or 
entity to which it is addressed. Access to this email by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it, is 
prohibited and may be unlawful.

Please consider the environment before printing this e-mail



_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to