[OMPI users] Bug when mixing sent types in version 1.6
Hi everybody, I have currently a bug when launching a very simple MPI program with mpirun, on connected nodes. This happens when I send an INT and then some CHAR strings from a master node to a worker node. Here is the minimal code to reproduce the bug : # include # include # include int main(int argc, char **argv) { int rank, size; const char someString[] = "Can haz cheezburgerz?"; MPI_Init(&argc, &argv); MPI_Comm_rank( MPI_COMM_WORLD, & rank ); MPI_Comm_size( MPI_COMM_WORLD, & size ); if ( rank == 0 ) { int len = strlen( someString ); int i; for( i = 1; i < size; ++i) { MPI_Send( &len, 1, MPI_INT, i, 0, MPI_COMM_WORLD ); MPI_Send( &someString, len+1, MPI_CHAR, i, 0, MPI_COMM_WORLD ); } } else { char buffer[ 128 ]; int receivedLen; MPI_Status stat; MPI_Recv( &receivedLen, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &stat ); printf( "[Worker] Length : %d\n", receivedLen ); MPI_Recv( buffer, receivedLen+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD, &stat); printf( "[Worker] String : %s\n", buffer ); } MPI_Finalize(); } I know that there is a better way to send a string, by giving a maximum buffer size at the second MPI_Recv, but there is no the main topic here. The launch works locally (i.e when the 2 processes are launched on one machine), but doesn't work when the 2 processes are dispatched in 2 machines through network (i.e one per host). In this case, the worker correctly reads the INT, and then master and worker block on the next call. I have no issue when sending only char strings or only numbers. This only happens when sending char strings then numbers, or in the other order. I'm using OpenMPI version 1.6, locally compiled. $ uname -a Linux trtp7097 2.6.32-220.13.1.el6.x86_64 #1 SMP Thu Mar 29 11:46:40 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux $ cat /etc/redhat-release Red Hat Enterprise Linux Workstation release 6.2 (Santiago) Is it a bad use of the framework or could it be a bug ? Thank you in advance. Benjamin
[OMPI users] RE : Bug when mixing sent types in version 1.6
Hi Jeff, Thanks for your answer. I have downloaded the Netpipe benchmarks suite, launched `make mpi` and launched with mpirun the resulting executable. Here is an interesting fact : by launching this executable on 2 nodes, it works ; on 3 nodes, it blocks, I guess on connect. Each process is running on a core, on each machine, using 100% of one CPU ; but nothing else happens. I have to kill the program to quit. Setting the option -mca btl_base_verbose to 30 shows me that the last thing tried by each node is to connect to other nodes. May it be a network issue ? Thanks, -- Benjamin Bouvier De : users-boun...@open-mpi.org [users-boun...@open-mpi.org] de la part de Jeff Squyres [jsquy...@cisco.com] Date d'envoi : vendredi 8 juin 2012 16:30 À : Open MPI Users Objet : Re: [OMPI users] Bug when mixing sent types in version 1.6 On Jun 8, 2012, at 6:43 AM, BOUVIER Benjamin wrote: > # include > # include > # include > > int main(int argc, char **argv) > { >int rank, size; >const char someString[] = "Can haz cheezburgerz?"; > >MPI_Init(&argc, &argv); > >MPI_Comm_rank( MPI_COMM_WORLD, & rank ); >MPI_Comm_size( MPI_COMM_WORLD, & size ); > >if ( rank == 0 ) >{ >int len = strlen( someString ); >int i; >for( i = 1; i < size; ++i) >{ >MPI_Send( &len, 1, MPI_INT, i, 0, MPI_COMM_WORLD ); >MPI_Send( &someString, len+1, MPI_CHAR, i, 0, MPI_COMM_WORLD ); >} >} else { >char buffer[ 128 ]; >int receivedLen; >MPI_Status stat; >MPI_Recv( &receivedLen, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &stat ); >printf( "[Worker] Length : %d\n", receivedLen ); >MPI_Recv( buffer, receivedLen+1, MPI_CHAR, 0, 0, MPI_COMM_WORLD, > &stat); >printf( "[Worker] String : %s\n", buffer ); >} > >MPI_Finalize(); > } I don't see anything obviously wrong with this code. > I know that there is a better way to send a string, by giving a maximum > buffer size at the second MPI_Recv, but there is no the main topic here. > The launch works locally (i.e when the 2 processes are launched on one > machine), but doesn't work when the 2 processes are dispatched in 2 machines > through network (i.e one per host). In this case, the worker correctly reads > the INT, and then master and worker block on the next call. That's very odd. > I have no issue when sending only char strings or only numbers. This only > happens when sending char strings then numbers, or in the other order. That's even more odd. Can you run standard benchmarks like MPI net pipe, and/or the OSU benchmarks? (across multiple nodes, that is) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] RE : RE : Bug when mixing sent types in version 1.6
Hi, > I'd guess that running net pipe with 3 procs may be undefined. It is indeed undefined. Running the net pipe program locally with 3 processors blocks, on my computer. This issue is especially weird as there is no problem for running the example program on network with MPICH2 implementation, for 2 processes. However, with MPICH2, it fails with 3 processes and blocks also on connect ("Connection refused"), which could indicate that it's actually a network issue, with both MPICH2 and OMPI. I don't know how many connections OMPI use to send the data in the example program, but with the assumption that it tries to open 2 connections (while for the same program, MPICH2 only uses one connection, which is another hypothesis), maybe the number of connections is the right way to look for. I'll ask MPICH2 users on their mailing list, so as to get their opinion about it. Now that I know the program doesn't work both with OMPI and MPICH2 implementations, I guess it's not dependant of MPI implementation. If you have any ideas or comments, I would be pleased to hear them. -- Benjamin Bouvier
[OMPI users] RE : RE : RE : Bug when mixing sent types in version 1.6
Hi, Thanks for your hints Jeff. I've just tried without any firewalls on involved machines, but the issue remains. # /etc/init.d/ip6tables status ip6tables: Firewall is not running. # /etc/init.d/iptables status iptables: Firewall is not running. The machines have the host names "node1", "node2" and "node3". I launch the basic program on one machine, asking node1 and node2 to be hosts. Typing `netstat -a | grep node1` from node2 shows me that node1 and node2 are connected by tcp, as the connection is marked as ESTABLISHED. I have the same thing when I do `netstat -a | grep node2` from node1. However, the program keeps blocking. What else could provoke that failure ? -- Benjamin BOUVIER To start, I would ensure that all firewalling (e.g., iptables) is disabled on all machines involved. On Jun 11, 2012, at 10:16 AM, BOUVIER Benjamin wrote: > Hi, > >> I'd guess that running net pipe with 3 procs may be undefined. > > It is indeed undefined. Running the net pipe program locally with 3 > processors blocks, on my computer. > > This issue is especially weird as there is no problem for running the example > program on network with MPICH2 implementation, for 2 processes. > > However, with MPICH2, it fails with 3 processes and blocks also on connect > ("Connection refused"), which could indicate that it's actually a network > issue, with both MPICH2 and OMPI. I don't know how many connections OMPI use > to send the data in the example program, but with the assumption that it > tries to open 2 connections (while for the same program, MPICH2 only uses one > connection, which is another hypothesis), maybe the number of connections is > the right way to look for. I'll ask MPICH2 users on their mailing list, so as > to get their opinion about it. > > Now that I know the program doesn't work both with OMPI and MPICH2 > implementations, I guess it's not dependant of MPI implementation. > > If you have any ideas or comments, I would be pleased to hear them. > > -- > Benjamin Bouvier > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] RE : RE : RE : RE : Bug when mixing sent types in version 1.6
Wow. I thought in the first place that all combinations would be equivalent, but in fact, this is not the case... I've kept the firewalls down during all the tests. > - on node1, "mpirun --host node1,node2 ring_c" Works. > - on node1, "mpirun --host node1,node3 ring_c" > - on node1, "mpirun --host node2,node3 ring_c" Blocks after "Process 0 sent to 1". > - on node1, "mpirun --host node1,node2,node3 ring_c" "Process 0 sending 10 to 1, tag 201 (3 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9" then blocks > Repeat all 4 from node2. On node 2 : - "mpirun --host node2,node1 ring_c" : OK - "mpirun --host node2,node3 ring_c" : blocks at same point that above. - "mpirun --host node1,node3 ring_c" : blocks at same point that above. - "mpirun --host node1,node2,node3 ring_c" : blocks at same point that mentioned above in case of 3 hosts. I recompiled this test program with MPICH2 and have the exactly same issues at the same time. There is really something wrong with that network... -- Benjamin Bouvier
[OMPI users] RE : RE : RE : RE : RE : Bug when mixing sent types in version 1.6
Hi, I've found, in ifconfig, that each node has 2 interfaces, eth0 and eth1. I've run mpiexec with parameter --mca btl_tcp_if_include eth0 (or eth1) to see if there was some issues between nodes. Here are the results : - node1,node2 works with eth1, not with eth0. - node1,node3 works with eth1, not with eth0. - node2,node3 does not work with eth1, but works with eth0. - node1,node2,node3 works with eth1 (!), not with eth0. These tests even work with activated firewalls. Actually, order of nodes is important, as `mpiexec --mca btl_tcp_if_include eth0 --host node1,node2 ./ring_c` does not work, but `mpiexec --mca btl_tcp_if_include eth0 --host node2,node1 ./ring_c` works. Same thing append if I change order when launching the 3 processes (putting node2 at the first position). I find that a little bit disturbing, but I guess the network configuration is guilty. Thanks a lot Jeff Squyres, your hints helped me to find the source of the problem. As it must often happen, the problem didn't come from OpenMPI but from network configuration. I'll ask my sysadmin to help me configuring the interfaces, so as it to work without defining mca parameter. Thank you one more time. -- Benjamin Bouvier > What's the output from ifconfig on all nodes? > >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/