2010/11/11 Jeff Squyres <jsquy...@cisco.com>:
> On Nov 11, 2010, at 3:23 PM, Krzysztof Zarzycki wrote:
>
>> No, unfortunately specification of interfaces is a little more 
>> complicated...  eth0/1/2 is not common for both machines.
>
> Can you define "common"?  Do you mean that eth0 on one machine is on a 
> different network then eth0 on the other machine?
>
> Is there any way that you can make them the same?  It would certainly make 
> things easier.

Yes, they are on different networks and unfortunately we are not
allowed to play with this.

>
>> I've tried to play with (oob/btl)_tcp_ if_include, but actually... I don't 
>> know exactly how.
>
> See my other mail:
>
>    http://www.open-mpi.org/community/lists/users/2010/11/14737.php
>
>> Anyway, do you have any ideas how to further debug the communication problem?
>
> The connect() is not getting through somehow.  Sadly, we don't have enough 
> debug messages to show exactly what is going wrong when these kinds of things 
> happen; I have a half-finished branch that has much better debug/error 
> messages, but I've never had the time to finish it (indeed, I think there's a 
> bug in that development branch right now, otherwise I'd recommend giving it a 
> whirl).  :-\

Analyzing the strace of both processes shows, that on both sides the
call to 'poll' after connect/accept succeeds. As I understand they
even exchange some information, which is always 8 bytes, like
D\227\0\1\0\0\0\0. One of them sends this information and the other
receives it. But after receiving, it does:

----
recv(8, "\5g\0\1\0\0\0\0", 8, 0)        = 8
fcntl64(8, F_GETFL)                     = 0x2 (flags O_RDWR)
fcntl64(8, F_SETFL, O_RDWR|O_NONBLOCK)  = 0
getpeername(8, {sa_family=AF_INET, sin_port=htons(57885),
sin_addr=inet_addr("10.0.0.2")}, [16]) = 0
close(8)
----

In a working scenario (on another machines), after receiving, these
bytes are resent and then proceeds the proper communication (my
'hello' message is sent).

The above address 10.0.0.2 is eth2 on the host machine, which indeed
should be used in this communication.

While playing with network interfaces it came out, that when we bring
down one of the aliases (eth2:0), it starts working. How should we
enforce mpirun not to use this alias, when it's up? We were trying to
use (oob/btl)_tcp_ if_exclude and specifying eth2:0, but it doesn't
seem to help.

Regards,
Grzegorz


>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>

Reply via email to