Jeff Squyres wrote:
On Dec 9, 2009, at 3:47 AM, Constantinos Makassikis wrote:
sometimes when running Open MPI jobs, the application hangs. By looking the
output I get the following error message:
[ic17][[34562,1],74][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv
] mca_btl_tcp_frag_recv: readv failed: No route to host (113)
I would expect Open MPI to eventually quit with an error at such situations.
Is the observed behaviour (i.e.: hanging) the intended one ?
That does seem weird. I would think that we should abort rather than hang.
But then again, the code is fairly hairy there -- there are many corner cases.
If so, what would be the reason(s) behind choosing the hanging over the
stopping ?
It *looks* like the code is supposed to retry the connection here, but perhaps
something is not operating correctly (or perhaps it *is* trying to reconnect
and the network is failing to reconnect for some reason...?).
I don't really know whether it is trying to reconnect. What is sure, is
that last time it happened, the destination node could indeed not be
reached (i.e.: couldn't ssh it nor did it repond to ping).
How often does this happen? Is it in the middle of the application run, or at the very beginning?
Did not happen very often: only after long and intensive usage of the
nodes. As for the time in the application execution, I couldn't figure
it out. Maybe it would be a good idea I modify the source code so that
I keep track of the progress.
Do you have any other network issues where connections get dropped, etc.? Do
you have any firewalls running on your cluster machines
To my knowledge there hasn't been any other network issues.
There are no firewalls.
I don't know if the current information is sufficient to answer with
certainty. I am going to try and look for more info whenever it occurs
again. About that, are there some options I could use in Open MPI to
gather more info ? Are there any other things I should pay attention
to ?
Thanks for your help,
--
Constantinos