yes and no,. So first, here is a quick fix for you: if you start the server using
mpirun -np 2 -mca coll ^inter ./server your test code finishes (with one minor modification to your code, namely the process that is being excluded on the client side does need a condition to leave the while loop as well.). That being said, here is what the problem seems to be when using the inter communicator module. The inter-comm barrier is handled initially by the basic module, and is implemented by calling an allreduce operation. The inter-communicator allreduce per default uses the implementation in the inter module, as a sequence of intra-reduce on the local communicator, a point-to-point exchange of the results of the two local groups by the local root processes (rank zero in the local groups of the intercomm), and a broadcast of the results on the local group. And it is this very last step that we are hanging. So bottom line, the intra-communicator broadcast for a communicator size of 1 is hanging, as far as I can see independent of whether we use tuned or basic. I do not recall on what the agreement was on how to treat the size=1 scenarios in coll. Looking at the routine in tuned ( e.g. ompi_coll_tuned_bcast_intra_generic ) there is a statement which clearly indicates that it should not be used for 1 proc assert(size>1) but I do not recall on which module or what the agreement was on how that was supposed to be treated correctly. I am also no sure why the bcast on 1 proc works on the server side but does not on the client side. That's where I stand right now in the analysis. Thanks Edgar On 3/26/2012 8:39 AM, Rodrigo Oliveira wrote: > Hi Edgar, > > Did you take a look at my code? Any idea about what is happening? I did > a lot of tests and it does not work. > > Thanks > > On Tue, Mar 20, 2012 at 3:43 PM, Rodrigo Oliveira > <rsilva.olive...@gmail.com <mailto:rsilva.olive...@gmail.com>> wrote: > > The command I use to compile and run is: > > mpic++ server.cc -o server && mpic++ client.cc -o client && mpirun > -np 1 ./server > > Rodrigo > > > On Tue, Mar 20, 2012 at 3:40 PM, Rodrigo Oliveira > <rsilva.olive...@gmail.com <mailto:rsilva.olive...@gmail.com>> wrote: > > Hi Edgar. > > Thanks for the response. The simplified code is attached: > server, client and a .h containing some constants. I put some > "prints" to show the behavior. > > Regards > > Rodrigo > > > On Tue, Mar 20, 2012 at 11:47 AM, Edgar Gabriel > <gabr...@cs.uh.edu <mailto:gabr...@cs.uh.edu>> wrote: > > do you have by any chance the actual or a small reproducer? > It might be > much easier to hunt the problem down... > > Thanks > Edgar > > On 3/19/2012 8:12 PM, Rodrigo Oliveira wrote: > > Hi there. > > > > I am facing a very strange problem when using MPI_Barrier > over an > > inter-communicator after some operations I describe bellow: > > > > 1) I start a server calling mpirun. > > 2) The server spawns 2 copies of a client using > MPI_Comm_spawn, creating > > an inter-communicator between the two groups. The server > group with 1 > > process (lets name it as A) and the client group with 2 > processes (group B). > > 3) After that, I need to detach one of the processes (rank > 0) in group B > > from the inter-communicator AB. To do that I do the > following steps: > > > > Server side: > > ..... > > tmp_inter_comm = client_comm.Create ( > client_comm.Get_group ( ) ); > > client_comm.Free ( ); > > client_comm = tmp_inter_comm; > > ..... > > client_comm.Barrier(); > > ..... > > > > Client side: > > .... > > rank = 0; > > tmp_inter_comm = server_comm.Create ( > server_comm.Get_group ( > > ).Excl ( 1, &rank ) ); > > server_comm.Free ( ); > > server_comm = tmp_inter_comm; > > ..... > > if (server_comm != MPI::COMM_NULL) > > server_comm.Barrier(); > > > > > > The problem: everything works fine until the call to > Barrier. In that > > point, the server exits the barrier, but the client at the > group B does > > not. Observe that we have only one process inside B, > because I used Excl > > to remove one process from the original group. > > > > p.s.: This occurs in the version 1.5.4 and the C++ API. > > > > I am very concerned about this problem because this > solution plays a > > very important role in my master thesis. > > > > Is this an ompi problem or am I doing something wrong? > > > > Thanks in advance > > > > Rodrigo Oliveira > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org <mailto:us...@open-mpi.org> > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Edgar Gabriel Associate Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
signature.asc
Description: OpenPGP digital signature