There are two connections to be specified:

-mca oob_tcp_if_include xxx
-mca btl_tcp_if_include xxx


On Nov 11, 2010, at 12:04 PM, Krzysztof Zarzycki wrote:

> Hi, 
> I'm working with Grzegorz on the mentioned problem. 
> If I'm correct on checking the firewall settings, "iptables --list" shows an 
> empty list of rules.
> The second host does not have iptables installed at all.
> 
> So what can be a next reason of this problem?
> 
> By the way, how can I enforce mpirun to use specific ethernet interface  for 
> connections, if I have several possible? 
> 
> Cheers,
> Krzysztof 
> 
> 2010/11/11 Jeff Squyres <jsquy...@cisco.com>
> I'd check the firewall settings.  The stack trace indicates that the one host 
> is trying to connect to the other (Open MPI initiates non-blocking TCP 
> connections that can be polled on later).
> 
> 
> On Nov 10, 2010, at 12:46 PM, David Zhang wrote:
> 
> > Have you double checked your firewall settings, TCP/IP settings, and SSH 
> > keys are all setup correctly for all machines including the host?
> >
> > On Wed, Nov 10, 2010 at 2:57 AM, Grzegorz Maj <ma...@wp.pl> wrote:
> > Hi all,
> > I've got a problem with sending messages from one of my machines. It
> > appears during MPI_Send/MPI_Recv and MPI_Bcast. The simplest case I've
> > found is two processes, rank 0 sending a simple message and rank 1
> > receiving this message. I execute these processes using mpirun with
> > -np 2.
> > - when both processes are executed on the host machine, it works fine;
> > - when both processes are executed on client machines (both on the
> > same or different machines), it works fine;
> > - when sender is executed on one of the client machines and receiver
> > on the host machine, it works fine;
> > - when sender is executed on the host machine and receiver on client
> > machine, it blocks.
> >
> > This last case is my problem. When adding option '--mca
> > btl_base_verbose 30' to mpirun, I get:
> >
> > ----------
> > [host:28186] mca: base: components_open: Looking for btl components
> > [host:28186] mca: base: components_open: opening btl components
> > [host:28186] mca: base: components_open: found loaded component self
> > [host:28186] mca: base: components_open: component self has no register 
> > function
> > [host:28186] mca: base: components_open: component self open function 
> > successful
> > [host:28186] mca: base: components_open: found loaded component sm
> > [host:28186] mca: base: components_open: component sm has no register 
> > function
> > [host:28186] mca: base: components_open: component sm open function 
> > successful
> > [host:28186] mca: base: components_open: found loaded component tcp
> > [host:28186] mca: base: components_open: component tcp has no register 
> > function
> > [host:28186] mca: base: components_open: component tcp open function 
> > successful
> > [host:28186] select: initializing btl component self
> > [host:28186] select: init of component self returned success
> > [host:28186] select: initializing btl component sm
> > [host:28186] select: init of component sm returned success
> > [host:28186] select: initializing btl component tcp
> > [host:28186] select: init of component tcp returned success
> > [client01:19803] mca: base: components_open: Looking for btl components
> > [client01:19803] mca: base: components_open: opening btl components
> > [client01:19803] mca: base: components_open: found loaded component self
> > [client01:19803] mca: base: components_open: component self has no
> > register function
> > [client01:19803] mca: base: components_open: component self open
> > function successful
> > [client01:19803] mca: base: components_open: found loaded component sm
> > [client01:19803] mca: base: components_open: component sm has no
> > register function
> > [client01:19803] mca: base: components_open: component sm open
> > function successful
> > [client01:19803] mca: base: components_open: found loaded component tcp
> > [client01:19803] mca: base: components_open: component tcp has no
> > register function
> > [client01:19803] mca: base: components_open: component tcp open
> > function successful
> > [client01:19803] select: initializing btl component self
> > [client01:19803] select: init of component self returned success
> > [client01:19803] select: initializing btl component sm
> > [client01:19803] select: init of component sm returned success
> > [client01:19803] select: initializing btl component tcp
> > [client01:19803] select: init of component tcp returned success
> > 00 of 2 host
> > [host:28186] btl: tcp: attempting to connect() to address 10.0.7.97 on
> > port 53255
> > 01 of 2 client01
> > ----------
> >
> > Where lines "00 of 2 host" and "01 of 2 client01" are just my debug
> > saying "mpirank of comm_size hostname". The last but one line appears
> > in call to Send:
> > MPI::COMM_WORLD.Send(message, 5, MPI::CHAR, 1, 13);
> >
> > When executing the sender on host with strace, I get:
> >
> > ----------
> > ...
> > connect(10, {sa_family=AF_INET, sin_port=htons(1024),
> > sin_addr=inet_addr("10.0.7.97")}, 16) = -1 EINPROGRESS (Operation now
> > in progress)
> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
> > events=POLLIN}, {fd=10, events=POLLOUT}], 7, 0) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
> > events=POLLIN}, {fd=10, events=POLLOUT}], 7, 0) = 1 ([{fd=10,
> > revents=POLLOUT}])
> > getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
> > send(10, "D\227\0\1\0\0\0\0", 8, 0)     = 8
> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
> > events=POLLIN}, {fd=10, events=POLLIN}], 7, 0) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
> > events=POLLIN}, {fd=10, events=POLLIN}], 7, 0) = 1 ([{fd=10,
> > revents=POLLIN}])
> > recv(10, "", 8, 0)                      = 0
> > close(10)                               = 0
> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
> > events=POLLIN}], 6, 0) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
> > events=POLLIN}], 6, 0) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
> > events=POLLIN}], 6, 0) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
> > events=POLLIN}], 6, 0) = 0 (Timeout)
> > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> > events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
> > events=POLLIN}], 6, 0) = 0 (Timeout)
> > ...
> > (forever)
> > ...
> > ----------
> >
> > For me it looks like the above connect is responsible for establishing
> > connection, but I'm afraid I don't understand what those calls for
> > poll are supposed to do.
> >
> > Attaching gdb to the sender gives me:
> >
> > ----------
> > (gdb) bt
> > #0  0xffffe410 in __kernel_vsyscall ()
> > #1  0x0064993b in poll () from /lib/libc.so.6
> > #2  0xf7df07b5 in poll_dispatch () from 
> > /home/gmaj/openmpi/lib/libopen-pal.so.0
> > #3  0xf7def8c3 in opal_event_base_loop () from
> > /home/gmaj/openmpi/lib/libopen-pal.so.0
> > #4  0xf7defbe7 in opal_event_loop () from
> > /home/gmaj/openmpi/lib/libopen-pal.so.0
> > #5  0xf7de323b in opal_progress () from 
> > /home/gmaj/openmpi/lib/libopen-pal.so.0
> > #6  0xf7c51455 in mca_pml_ob1_send () from
> > /home/gmaj/openmpi/lib/openmpi/mca_pml_ob1.so
> > #7  0xf7ed9c60 in PMPI_Send () from /home/gmaj/openmpi/lib/libmpi.so.0
> > #8  0x0804e900 in main ()
> > ----------
> >
> > If anybody knows what may cause this problem or what may I do to find
> > the reason, any help is appreciated.
> >
> > My open-mpi is version 1.4.1.
> >
> >
> > Regards,
> > Grzegorz Maj
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> >
> > --
> > David Zhang
> > University of California, San Diego
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to