Hi, > Am 15.11.2016 um 15:14 schrieb Manfred Selz <manfred.s...@diasemi.com>: > > Hi, > > similar issues have been reported a long time ago, but I haven’t seen a > recent solution to this. > > In one of our company’s SGE 6.2.u5 clusters, qrsh/qlogin jobs fail on > selected hosts with messages like this: > > $ qrsh -l rhel=6,login=1,hostname=casrvodc-17 -verbose > ... > Your job 1756874 ("QRLOGIN") has been submitted > > waiting for interactive job to be scheduled ...timeout (3 s) expired while > waiting on socket fd 4 > > Your interactive job 1756874 has been successfully scheduled. > timeout (5 s) expired while waiting on socket fd 4
Did you enable any firewall in the cluster to block certain ports on the nodes? -- Reuti > This goes for some time, the jobs can even be seen briefly via qstat - > however, the jobs never really kick in, switch themselves to “dr” stated and > are finally gone (after a minute or so). > The exec host’s messages file has lines like this: > > 11/15/2016 05:59:50| main|casrvodc-17|I|SIGNAL jid: 1756876 jatask: 1 > signal: KILL > > The main messages file has this: > > 11/15/2016 05:59:50|worker|casrvodc-01|I|mselz has registered the job 1756876 > for deletion > 11/15/2016 05:59:51|worker|casrvodc-01|I|removing trigger to terminate job > 1756876.1 > 11/15/2016 05:59:51|worker|casrvodc-01|W|job 1756876.1 failed on host > casrvodc-17.diasemi.com assumedly after job because: job 1756876.1 died > through signal KILL (9) > > Until a few days ago, qrsh used to work on all hosts in the cluster, and this > suddenly stopped for most (but not all!) of them, without a deliberate change > in SGE config or host config (for instance, “uptime” confirms that the hosts > have not been recently rebooted. Otherwise, the hosts in the cluster are all > of same type (hardware), kernel version, etc., with no significant difference > I have been able to identify yet. > > For the same hosts, also a “qsub -now y” fails. > > I have verified proper sge execd operation and host identification with > “qping”, “gethostbyaddr”, and “gethostbyname”, and this looks all fine. > > Currently I am quite puzzled - I’d appreciate any input somebody may have on > how to further debug or resolve. > > Best regards, > Manfred > > > > > Dialog Semiconductor GmbH > Neue Str. 95 > D-73230 Kirchheim > Managing Directors: Dr. Jalal Bagherli, Carsten Dahl > Chairman of the Supervisory Board: Rich Beyer > Commercial register: Amtsgericht Stuttgart: HRB 231181 > UST-ID-Nr. DE 811121668 > > Legal Disclaimer: This e-mail communication (and any attachment/s) is > confidential and contains proprietary information, some or all of which may > be legally privileged. It is intended solely for the use of the individual or > entity to which it is addressed. Access to this email by anyone else is > unauthorized. If you are not the intended recipient, any disclosure, copying, > distribution or any action taken or omitted to be taken in reliance on it, is > prohibited and may be unlawful. > > > Please consider the environment before printing this e-mail > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users