My apologies for the tardy response - been stuck in meetings. I'm glad to
hear that you are making progress tracking this down. FWIW: the error
message you received indicates that the socket from that node unexpectedly
reset during execution of the application. So it sounds like there is
something flaky in the Ethernet.

One thing I've found that can cause that problem is two nodes having the
same IP address. This causes periodic random resets of the connections. So
you might want to just do an IP scan to ensure that all the addresses are
unique.

Let us know if we can be of help
Ralph


On Tue, Apr 12, 2016 at 7:22 AM, Stefan Friedel <
stefan.frie...@iwr.uni-heidelberg.de> wrote:

> On Tue, Apr 12, 2016 at 01:30:37PM +0200, Stefan Friedel wrote:
>
>> -thanks for you support!- nope, no core, just the "orte has lost"...
>>
> Dear list - the problem is _not_ related to openmpi. I compiled mvapich2
> and I
> get communication errors,too. Probably this is a hardware problem.
> Sorry for the noise - I will report about the real reason for the orte has
> lost... message.
>
> MfG/Sincerely
> Stefan Friedel
> --
> IWR * 4.317 * INF205 * 69120 Heidelberg
> T +49 6221 5414404 * F +49 6221 5414427
> stefan.frie...@iwr.uni-heidelberg.de
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/04/28927.php
>

Reply via email to