George -- can you file a ticket about this?
On Jun 12, 2011, at 1:25 PM, George Bosilca wrote: > Fraderic, > > Based on the current version of the MPI standard, the two groups involved in > the intercomm_create have to be disjoints, which means the leader cannot be > the same process. > > Regarding the issue in Open MPI, the problem is deep in our modex exchange > (contact information). In the example I sent around a while back, the > intercomm_create is working, but the resulting communicator contains > processes without this modex information. This lead to an error on the next > collective communication. > > george. > > On Jun 12, 2011, at 03:44 , Frédéric Feyel wrote: > >> Dear all, thank you very much for the time spent at looking at my problem. >> >> After reading your contributions, it's not clear wether there is a bug in >> OpenMPI or not. >> >> So I created a small self contained source code to analyse the behavior, >> and the problem is still there. >> >> I was wondering if the local and remote leader in the 2 groups could be >> the same process. Unfortunately, I get >> an error in the two cases (local and remote leader identical or not). >> >> What do you think about my small source code ? >> >> Best regards, >> >> Frédéric. >> >> >> On Tue, 07 Jun 2011 10:31:51 -0500, Edgar Gabriel <gabr...@cs.uh.edu> >> wrote: >>> On 6/7/2011 10:23 AM, George Bosilca wrote: >>>> >>>> On Jun 7, 2011, at 11:00 , Edgar Gabriel wrote: >>>> >>>>> George, >>>>> >>>>> I did not look over all the details of your test, but it looks to >>>>> me like you are violating one of the requirements of >>>>> intercomm_create namely the request that the two groups have to be >>>>> disjoint. In your case the parent process(es) are part of both >>>>> local intra-communicators, isn't it? >>>> >>>> The two groups of the two local communicators are disjoints. One >>>> contains A,B while the other only C. The bridge communicator contains >>>> A,C. >>>> >>>> I'm confident my example is supposed to work. At least for Open MPI >>>> the error is under the hood, as the resulting inter-communicator is >>>> valid but contains NULL endpoints for the remote process. >>> >>> I'll come back to that later, I am not yet convinced that your code is >>> correct :-) Your local groups might be disjoint, but I am worried about >>> the ranks of the remote leader in your example. THey can not be 0 from >>> both groups perspective. >>> >>>> >>>> Regarding the fact that the two leader should be separate processes, >>>> you will not find any wording about this in the current version of >>>> the standard. In the 1.1 there were two opposite sentences about this >>>> one stating that the two groups can be disjoint, while the other >>>> claiming that the two leaders can be the same process. After >>>> discussion, the agreement was that the two groups have to be >>>> disjoint, and the standard has been amended to match the agreement. >>> >>> >>> I realized that this is a non-issue. If the two local groups are >>> disjoint, there is no way that the two local leaders are the same >> process. >>> >>> Thanks >>> Edgar >>> >>>> >>>> george. >>>> >>>> >>>>> >>>>> I just have MPI-1.1. at hand right now, but here is what it says: >>>>> ---- >>>>> >>>>> Overlap of local and remote groups that are bound into an >>>>> inter-communicator is prohibited. If there is overlap, then the >>>>> program is erroneous and is likely to deadlock. >>>>> >>>>> ---- so bottom line is that the two local intra-communicators that >>>>> are being used have to be disjoint, and the bridgecomm needs to be >>>>> a communicator where at least one process of each of the two >>>>> disjoint groups need to be able to talk to each other. >>>>> Interestingly I did not find a sentence whether it is allowed to be >>>>> the same process, or whether the two local leaders need to be >>>>> separate processes... >>>>> >>>>> >>>>> Thanks Edgar >>>>> >>>>> >>>>> On 6/7/2011 12:57 AM, George Bosilca wrote: >>>>>> Frederic, >>>>>> >>>>>> Attached you will find an example that is supposed to work. The >>>>>> main difference with your code is on T3, T4 where you have >>>>>> inversed the local and remote comm. As depicted on the picture >>>>>> attached below, during the 3th step you will create the intercomm >>>>>> between ab and c (no overlap) using ac as a bridge communicator >>>>>> (here the two roots, a and c, can exchange messages). >>>>>> >>>>>> Based on the MPI 2.2 standard, especially on the paragraph in >>>>>> PS:, the attached code should have been working. Unfortunately, I >>>>>> couldn't run it successfully neither with Open MPI trunk nor >>>>>> MPICH2 1.4rc1. >>>>>> >>>>>> george. >>>>>> >>>>>> PS: Here is what the MPI standard states about the >>>>>> MPI_Intercomm_create: >>>>>>> The function MPI_INTERCOMM_CREATE can be used to create an >>>>>>> inter-communicator from two existing intra-communicators, in >>>>>>> the following situation: At least one selected member from each >>>>>>> group (the “group leader”) has the ability to communicate with >>>>>>> the selected member from the other group; that is, a “peer” >>>>>>> communicator exists to which both leaders belong, and each >>>>>>> leader knows the rank of the other leader in this peer >>>>>>> communicator. Furthermore, members of each group know the rank >>>>>>> of their leader. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Jun 1, 2011, at 05:00 , Frédéric Feyel wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I have a problem using MPI_Intercomm_create. >>>>>>> >>>>>>> I 5 tasks, let's say T0, T1, T2, T3, T4 resulting from two >>>>>>> spawn operations by T0. >>>>>>> >>>>>>> So I have two intra-communicator : >>>>>>> >>>>>>> intra0 contains : T0, T1, T2 intra1 contains : T0, T3, T4 >>>>>>> >>>>>>> my goal is to make a collective loop to build a single >>>>>>> intra-communicator containing T0, T1, T2, T3, T4 >>>>>>> >>>>>>> I tried to do it using MPI_Intercomm_create and >>>>>>> MPI_Intercom_merge calls, but without success (I always get MPI >>>>>>> internal errors). >>>>>>> >>>>>>> What I am doing : >>>>>>> >>>>>>> on T0 : ******* >>>>>>> >>>>>>> MPI_Intercom_create(intra0,0,intra1,0,1,&new_com) >>>>>>> >>>>>>> on T1 and T2 : ************** >>>>>>> >>>>>>> MPI_Intercom_create(intra0,0,MPI_COMM_WORLD,0,1,&new_com) >>>>>>> >>>>>>> on T3 and T4 : ************** >>>>>>> >>>>>>> MPI_Intercom_create(intra1,0,MPI_COMM_WORLD,0,1,&new_com) >>>>>>> >>>>>>> >>>>>>> I'm certainly missing something. Could anybody help me to solve >>>>>>> this problem ? >>>>>>> >>>>>>> Best regards, >>>>>>> >>>>>>> Frédéric. >>>>>>> >>>>>>> PS : of course I did an extensive web search without finding >>>>>>> anything usefull on my problem. >>>>>>> >>>>>>> _______________________________________________ users mailing >>>>>>> list us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ users mailing >>>>>> list us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> -- Edgar Gabriel Assistant Professor Parallel Software Technologies >>>>> Lab http://pstl.cs.uh.edu Department of Computer Science >>>>> University of Houston Philip G. Hoffman Hall, Room 524 >>>>> Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: >>>>> +1 (713) 743-3335 >>>>> >>>>> _______________________________________________ users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> <spawn-example.c>_______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/