yes and no,. So first, here is a quick fix for you: if you start the
server using

mpirun -np 2 -mca coll ^inter ./server

your test code finishes (with one minor modification to your code,
namely the process that is being excluded on the client side does need a
condition to leave the while loop as well.).

That being said, here is what the problem seems to be when using the
inter communicator module. The inter-comm barrier is handled initially
by the basic module, and is implemented by calling an allreduce
operation. The inter-communicator allreduce  per default uses the
implementation in the inter module, as a sequence of intra-reduce on the
local communicator, a point-to-point exchange of the results of the two
local groups by the local root processes (rank zero in the local groups
of the intercomm), and a broadcast of the results on the local group.
And it is this very last step that we are hanging.

So bottom line, the intra-communicator broadcast for a communicator size
of 1 is hanging, as far as I can see independent of whether we use tuned
or basic.

I do not recall on what the agreement was on how to treat the size=1
scenarios in coll. Looking at the routine in tuned ( e.g.
ompi_coll_tuned_bcast_intra_generic ) there is a statement which clearly
indicates that it should not be used for 1 proc

assert(size>1)

but I do not recall on which module or what the agreement was on how
that was supposed to be treated correctly. I am also no sure why the
bcast on 1 proc works on the server side but does not on the client
side. That's where I stand right now in the analysis.


Thanks
Edgar

On 3/26/2012 8:39 AM, Rodrigo Oliveira wrote:
> Hi Edgar, 
> 
> Did you take a look at my code? Any idea about what is happening? I did
> a lot of tests and it does not work.
> 
> Thanks
> 
> On Tue, Mar 20, 2012 at 3:43 PM, Rodrigo Oliveira
> <rsilva.olive...@gmail.com <mailto:rsilva.olive...@gmail.com>> wrote:
> 
>     The command I use to compile and run is:
> 
>     mpic++ server.cc -o server && mpic++ client.cc -o client && mpirun
>     -np 1 ./server
> 
>     Rodrigo
> 
> 
>     On Tue, Mar 20, 2012 at 3:40 PM, Rodrigo Oliveira
>     <rsilva.olive...@gmail.com <mailto:rsilva.olive...@gmail.com>> wrote:
> 
>         Hi Edgar.
> 
>         Thanks for the response. The simplified code is attached:
>         server, client and a .h containing some constants. I put some
>         "prints" to show the behavior.
> 
>         Regards
> 
>         Rodrigo
> 
> 
>         On Tue, Mar 20, 2012 at 11:47 AM, Edgar Gabriel
>         <gabr...@cs.uh.edu <mailto:gabr...@cs.uh.edu>> wrote:
> 
>             do you have by any chance the actual or a small reproducer?
>             It might be
>             much easier to hunt the problem down...
> 
>             Thanks
>             Edgar
> 
>             On 3/19/2012 8:12 PM, Rodrigo Oliveira wrote:
>             > Hi there.
>             >
>             > I am facing a very strange problem when using MPI_Barrier
>             over an
>             > inter-communicator after some operations I describe bellow:
>             >
>             > 1) I start a server calling mpirun.
>             > 2) The server spawns 2 copies of a client using
>             MPI_Comm_spawn, creating
>             > an inter-communicator between the two groups. The server
>             group with 1
>             > process (lets name it as A) and the client group with 2
>             processes (group B).
>             > 3) After that, I need to detach one of the processes (rank
>             0) in group B
>             > from the inter-communicator AB. To do that I do the
>             following steps:
>             >
>             > Server side:
>             >         .....
>             >         tmp_inter_comm = client_comm.Create (
>             client_comm.Get_group ( ) );
>             > client_comm.Free ( );
>             > client_comm = tmp_inter_comm;
>             >         .....
>             >         client_comm.Barrier();
>             >         .....
>             >
>             > Client side:
>             >         ....
>             >         rank = 0;
>             >         tmp_inter_comm = server_comm.Create (
>             server_comm.Get_group (
>             > ).Excl ( 1, &rank ) );
>             > server_comm.Free ( );
>             > server_comm = tmp_inter_comm;
>             >         .....
>             >         if (server_comm != MPI::COMM_NULL)
>             >             server_comm.Barrier();
>             >
>             >
>             > The problem: everything works fine until the call to
>             Barrier. In that
>             > point, the server exits the barrier, but the client at the
>             group B does
>             > not. Observe that we have only one process inside B,
>             because I used Excl
>             > to remove one process from the original group.
>             >
>             > p.s.: This occurs in the version 1.5.4 and the C++ API.
>             >
>             > I am very concerned about this problem because this
>             solution plays a
>             > very important role in my master thesis.
>             >
>             > Is this an ompi problem or am I doing something wrong?
>             >
>             > Thanks in advance
>             >
>             > Rodrigo Oliveira
>             >
>             >
>             > _______________________________________________
>             > users mailing list
>             > us...@open-mpi.org <mailto:us...@open-mpi.org>
>             > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
>             _______________________________________________
>             users mailing list
>             us...@open-mpi.org <mailto:us...@open-mpi.org>
>             http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Edgar Gabriel
Associate Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to