Re: [OMPI users] btl_tcp_endpoint errors

Jeff Squyres Wed, 4 Apr 2007 15:28:42 -0400

On Apr 3, 2007, at 1:22 PM, Heywood, Todd wrote:

ssh: connect to host blade45 port 22: No route to host

[blade1:05832] ERROR: A daemon on node blade45 failed to start asexpected.

[blade1:05832] ERROR: There may be more information available from
[blade1:05832] ERROR: the remote shell (see above).
[blade1:05832] ERROR: The daemon exited unexpectedly with status 1.
[blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file
../../../../orte/mca/pls/base/pls_base_orted_cmds.c at line 188
[blade1:05832] [0,0,0] ORTE_ERROR_LOG: Timeout in file
../../../../../orte/mca/pls/rsh/pls_rsh_module.c at line 1187

I can understand this arising from an ssh bottleneck, with atimeout. So, aquestion to the OMPI folks: could the "no route to host" (113)error in

btl_tcp_endpoint.c:572 also result from a timeout?

I think it *could*, but it's really an OS-level question. OMPI issimply reporting what errno is giving us back from a failed TCPconnect() API call.

The timeout shown in the error message above is really an ORTEtimeout, meaning that we waited for a daemon to start that didn't, sowe timed out and gave up. It's on the "to do" list to recognizequicker that an ssh failed (or any of the other starters failed --SLURM/srun failures behaves similarly to ssh failures right now)faster than a timeout, probably not until at least the 1.3 timeframe,however.


--
Jeff Squyres
Cisco Systems

Re: [OMPI users] btl_tcp_endpoint errors

Reply via email to