Hi,

a solution for this issue was found.

It turned out that the firewall settings for some hosts had the high ports 
(which SGE uses for ssh connections) blocked by accident in one direction. 
Debugging was a little difficult, but as the jobs appeared briefly in the exec 
host's "active_jobs" directory, enough information could be finally collected 
(by comparing the "qrsh_control_port" settings there with denied ports in a 
host's "dmesg" report).

As suggested some time ago, it would be helpful if the SGE high port usage 
could be somehow configured or restricted to a specific range, but this seems 
not to be possible in SGE (at least in the old OpenSource SGE 6.2u5).

Thanks to all who had replied.

Regards,
Manfred

From: Manfred Selz
Sent: Dienstag, 15. November 2016 15:14
To: 'users@gridengine.org'
Subject: [gridengine users] issue with qrsh "waiting on socket fd 4" in SGE 
6.2u5

Hi,

similar issues have been reported a long time ago, but I haven't seen a recent 
solution to this.

In one of our company's SGE 6.2.u5 clusters, qrsh/qlogin jobs fail on selected 
hosts with messages like this:

$  qrsh -l rhel=6,login=1,hostname=casrvodc-17 -verbose
...
Your job 1756874 ("QRLOGIN") has been submitted
waiting for interactive job to be scheduled ...timeout (3 s) expired while 
waiting on socket fd 4

Your interactive job 1756874 has been successfully scheduled.
timeout (5 s) expired while waiting on socket fd 4

This goes for some time, the jobs can even be seen briefly via qstat - however, 
the jobs never really kick in, switch themselves to "dr" stated and are finally 
gone (after a minute or so).
The exec host's messages file has lines like this:

11/15/2016 05:59:50|  main|casrvodc-17|I|SIGNAL jid: 1756876 jatask: 1 signal: 
KILL

The main messages file has this:

11/15/2016 05:59:50|worker|casrvodc-01|I|mselz has registered the job 1756876 
for deletion
11/15/2016 05:59:51|worker|casrvodc-01|I|removing trigger to terminate job 
1756876.1
11/15/2016 05:59:51|worker|casrvodc-01|W|job 1756876.1 failed on host 
casrvodc-17.diasemi.com assumedly after job because: job 1756876.1 died through 
signal KILL (9)

Until a few days ago, qrsh used to work on all hosts in the cluster, and this 
suddenly stopped for most (but not all!) of them, without a deliberate change 
in SGE config or host config (for instance, "uptime" confirms that the hosts 
have not been recently rebooted. Otherwise, the hosts in the cluster are all of 
same type (hardware), kernel version, etc., with no significant difference I 
have been able to identify yet.

For the same hosts, also a "qsub -now y" fails.

I have verified proper sge execd operation and host identification with 
"qping", "gethostbyaddr", and "gethostbyname", and this looks all fine.

Currently I am quite puzzled - I'd appreciate any input somebody may have on 
how to further debug or resolve.

Best regards,
Manfred


________________________________

Dialog Semiconductor GmbH
Neue Str. 95
D-73230 Kirchheim
Managing Directors: Dr. Jalal Bagherli, Carsten Dahl
Chairman of the Supervisory Board: Rich Beyer
Commercial register: Amtsgericht Stuttgart: HRB 231181
UST-ID-Nr. DE 811121668

Legal Disclaimer: This e-mail communication (and any attachment/s) is 
confidential and contains proprietary information, some or all of which may be 
legally privileged. It is intended solely for the use of the individual or 
entity to which it is addressed. Access to this email by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it, is 
prohibited and may be unlawful.

Please consider the environment before printing this e-mail


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to