No, unfortunately specification of interfaces is a little more complicated... eth0/1/2 is not common for both machines.
I've tried to play with (oob/btl)_tcp_ if_include, but actually... I don't know exactly how. Anyway, do you have any ideas how to further debug the communication problem? Cheers, Krzysztof 2010/11/11 Ralph Castain <r...@open-mpi.org> > There are two connections to be specified: > > -mca oob_tcp_if_include xxx > -mca btl_tcp_if_include xxx > > > On Nov 11, 2010, at 12:04 PM, Krzysztof Zarzycki wrote: > > Hi, > I'm working with Grzegorz on the mentioned problem. > If I'm correct on checking the firewall settings, "iptables --list" shows > an empty list of rules. > The second host does not have iptables installed at all. > > So what can be a next reason of this problem? > > By the way, how can I enforce mpirun to use specific ethernet interface > for connections, if I have several possible? > > Cheers, > Krzysztof > > 2010/11/11 Jeff Squyres <jsquy...@cisco.com> > >> I'd check the firewall settings. The stack trace indicates that the one >> host is trying to connect to the other (Open MPI initiates non-blocking TCP >> connections that can be polled on later). >> >> >> On Nov 10, 2010, at 12:46 PM, David Zhang wrote: >> >> > Have you double checked your firewall settings, TCP/IP settings, and SSH >> keys are all setup correctly for all machines including the host? >> > >> > On Wed, Nov 10, 2010 at 2:57 AM, Grzegorz Maj <ma...@wp.pl> wrote: >> > Hi all, >> > I've got a problem with sending messages from one of my machines. It >> > appears during MPI_Send/MPI_Recv and MPI_Bcast. The simplest case I've >> > found is two processes, rank 0 sending a simple message and rank 1 >> > receiving this message. I execute these processes using mpirun with >> > -np 2. >> > - when both processes are executed on the host machine, it works fine; >> > - when both processes are executed on client machines (both on the >> > same or different machines), it works fine; >> > - when sender is executed on one of the client machines and receiver >> > on the host machine, it works fine; >> > - when sender is executed on the host machine and receiver on client >> > machine, it blocks. >> > >> > This last case is my problem. When adding option '--mca >> > btl_base_verbose 30' to mpirun, I get: >> > >> > ---------- >> > [host:28186] mca: base: components_open: Looking for btl components >> > [host:28186] mca: base: components_open: opening btl components >> > [host:28186] mca: base: components_open: found loaded component self >> > [host:28186] mca: base: components_open: component self has no register >> function >> > [host:28186] mca: base: components_open: component self open function >> successful >> > [host:28186] mca: base: components_open: found loaded component sm >> > [host:28186] mca: base: components_open: component sm has no register >> function >> > [host:28186] mca: base: components_open: component sm open function >> successful >> > [host:28186] mca: base: components_open: found loaded component tcp >> > [host:28186] mca: base: components_open: component tcp has no register >> function >> > [host:28186] mca: base: components_open: component tcp open function >> successful >> > [host:28186] select: initializing btl component self >> > [host:28186] select: init of component self returned success >> > [host:28186] select: initializing btl component sm >> > [host:28186] select: init of component sm returned success >> > [host:28186] select: initializing btl component tcp >> > [host:28186] select: init of component tcp returned success >> > [client01:19803] mca: base: components_open: Looking for btl components >> > [client01:19803] mca: base: components_open: opening btl components >> > [client01:19803] mca: base: components_open: found loaded component self >> > [client01:19803] mca: base: components_open: component self has no >> > register function >> > [client01:19803] mca: base: components_open: component self open >> > function successful >> > [client01:19803] mca: base: components_open: found loaded component sm >> > [client01:19803] mca: base: components_open: component sm has no >> > register function >> > [client01:19803] mca: base: components_open: component sm open >> > function successful >> > [client01:19803] mca: base: components_open: found loaded component tcp >> > [client01:19803] mca: base: components_open: component tcp has no >> > register function >> > [client01:19803] mca: base: components_open: component tcp open >> > function successful >> > [client01:19803] select: initializing btl component self >> > [client01:19803] select: init of component self returned success >> > [client01:19803] select: initializing btl component sm >> > [client01:19803] select: init of component sm returned success >> > [client01:19803] select: initializing btl component tcp >> > [client01:19803] select: init of component tcp returned success >> > 00 of 2 host >> > [host:28186] btl: tcp: attempting to connect() to address 10.0.7.97 on >> > port 53255 >> > 01 of 2 client01 >> > ---------- >> > >> > Where lines "00 of 2 host" and "01 of 2 client01" are just my debug >> > saying "mpirank of comm_size hostname". The last but one line appears >> > in call to Send: >> > MPI::COMM_WORLD.Send(message, 5, MPI::CHAR, 1, 13); >> > >> > When executing the sender on host with strace, I get: >> > >> > ---------- >> > ... >> > connect(10, {sa_family=AF_INET, sin_port=htons(1024), >> > sin_addr=inet_addr("10.0.7.97")}, 16) = -1 EINPROGRESS (Operation now >> > in progress) >> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, >> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, >> > events=POLLIN}, {fd=10, events=POLLOUT}], 7, 0) = 0 (Timeout) >> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, >> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, >> > events=POLLIN}, {fd=10, events=POLLOUT}], 7, 0) = 1 ([{fd=10, >> > revents=POLLOUT}]) >> > getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 >> > send(10, "D\227\0\1\0\0\0\0", 8, 0) = 8 >> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, >> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, >> > events=POLLIN}, {fd=10, events=POLLIN}], 7, 0) = 0 (Timeout) >> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, >> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, >> > events=POLLIN}, {fd=10, events=POLLIN}], 7, 0) = 1 ([{fd=10, >> > revents=POLLIN}]) >> > recv(10, "", 8, 0) = 0 >> > close(10) = 0 >> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, >> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, >> > events=POLLIN}], 6, 0) = 0 (Timeout) >> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, >> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, >> > events=POLLIN}], 6, 0) = 0 (Timeout) >> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, >> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, >> > events=POLLIN}], 6, 0) = 0 (Timeout) >> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, >> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, >> > events=POLLIN}], 6, 0) = 0 (Timeout) >> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, >> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, >> > events=POLLIN}], 6, 0) = 0 (Timeout) >> > ... >> > (forever) >> > ... >> > ---------- >> > >> > For me it looks like the above connect is responsible for establishing >> > connection, but I'm afraid I don't understand what those calls for >> > poll are supposed to do. >> > >> > Attaching gdb to the sender gives me: >> > >> > ---------- >> > (gdb) bt >> > #0 0xffffe410 in __kernel_vsyscall () >> > #1 0x0064993b in poll () from /lib/libc.so.6 >> > #2 0xf7df07b5 in poll_dispatch () from >> /home/gmaj/openmpi/lib/libopen-pal.so.0 >> > #3 0xf7def8c3 in opal_event_base_loop () from >> > /home/gmaj/openmpi/lib/libopen-pal.so.0 >> > #4 0xf7defbe7 in opal_event_loop () from >> > /home/gmaj/openmpi/lib/libopen-pal.so.0 >> > #5 0xf7de323b in opal_progress () from >> /home/gmaj/openmpi/lib/libopen-pal.so.0 >> > #6 0xf7c51455 in mca_pml_ob1_send () from >> > /home/gmaj/openmpi/lib/openmpi/mca_pml_ob1.so >> > #7 0xf7ed9c60 in PMPI_Send () from /home/gmaj/openmpi/lib/libmpi.so.0 >> > #8 0x0804e900 in main () >> > ---------- >> > >> > If anybody knows what may cause this problem or what may I do to find >> > the reason, any help is appreciated. >> > >> > My open-mpi is version 1.4.1. >> > >> > >> > Regards, >> > Grzegorz Maj >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > >> > -- >> > David Zhang >> > University of California, San Diego >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >