We just discovered this ticket, which might describe the same problem that we have:
https://svn.open-mpi.org/trac/ompi/ticket/1505 It seems unresolved... do you have a workaround for it? I've seen the "-mca opal_net_private_ipv4 " parameter, but I don't exactly know how to use it... At least my experiments failed to do anything. I'll be very grateful for your help, Krzysztof 2010/11/17 Grzegorz Maj <ma...@wp.pl> > 2010/11/11 Jeff Squyres <jsquy...@cisco.com>: > > On Nov 11, 2010, at 3:23 PM, Krzysztof Zarzycki wrote: > > > >> No, unfortunately specification of interfaces is a little more > complicated... eth0/1/2 is not common for both machines. > > > > Can you define "common"? Do you mean that eth0 on one machine is on a > different network then eth0 on the other machine? > > > > Is there any way that you can make them the same? It would certainly > make things easier. > > Yes, they are on different networks and unfortunately we are not > allowed to play with this. > > > > >> I've tried to play with (oob/btl)_tcp_ if_include, but actually... I > don't know exactly how. > > > > See my other mail: > > > > http://www.open-mpi.org/community/lists/users/2010/11/14737.php > > > >> Anyway, do you have any ideas how to further debug the communication > problem? > > > > The connect() is not getting through somehow. Sadly, we don't have > enough debug messages to show exactly what is going wrong when these kinds > of things happen; I have a half-finished branch that has much better > debug/error messages, but I've never had the time to finish it (indeed, I > think there's a bug in that development branch right now, otherwise I'd > recommend giving it a whirl). :-\ > > Analyzing the strace of both processes shows, that on both sides the > call to 'poll' after connect/accept succeeds. As I understand they > even exchange some information, which is always 8 bytes, like > D\227\0\1\0\0\0\0. One of them sends this information and the other > receives it. But after receiving, it does: > > ---- > recv(8, "\5g\0\1\0\0\0\0", 8, 0) = 8 > fcntl64(8, F_GETFL) = 0x2 (flags O_RDWR) > fcntl64(8, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > getpeername(8, {sa_family=AF_INET, sin_port=htons(57885), > sin_addr=inet_addr("10.0.0.2")}, [16]) = 0 > close(8) > ---- > > In a working scenario (on another machines), after receiving, these > bytes are resent and then proceeds the proper communication (my > 'hello' message is sent). > > The above address 10.0.0.2 is eth2 on the host machine, which indeed > should be used in this communication. > > While playing with network interfaces it came out, that when we bring > down one of the aliases (eth2:0), it starts working. How should we > enforce mpirun not to use this alias, when it's up? We were trying to > use (oob/btl)_tcp_ if_exclude and specifying eth2:0, but it doesn't > seem to help. > > Regards, > Grzegorz > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >