Strange -- this almost implies a race condition somewhere.

I don't see anything wrong with your application (other than it doesn't free 
the communicators, but that's not an error).

Edgar -- the intercomm code is yours.  Could you have a look?



On Jan 23, 2012, at 11:03 AM, jody wrote:

> Hi
> I've got a really strange problem:
> 
> I've got an application which creates intercommunicators between a
> master and some workers.
> 
> When i run it on our cluster with 11  processes it works,
> when i run it with 12 processes it hangs inside MPI_Intercomm_create().
> 
> This is the hostfile:
>  squid_0.uzh.ch  slots=3  max-slots=3
>  squid_1.uzh.ch  slots=2  max-slots=2
>  squid_2.uzh.ch  slots=1  max-slots=1
>  squid_3.uzh.ch  slots=1  max-slots=1
>  triops.uzh.ch   slots=8 max-slots=8
> 
> Actually all squid_X have 4 cores, but i managed to reduce the number of
> processes needed for failure by making the above settings.
> 
> So with all available squid cores and 3 triops cores it works,
> but with 4 triops cores it hangs.
> 
> On the other hand, if i use all 16 squid cores (but no triops cores)
> it works, too.
> 
> If i start the application not from triopps, but froim another workstation,
> i have a similar pattern of Intercomm_create failures.
> 
> Note that with the above hostfile a simple HelloMPI works also with 14
> or more processes.
> 
> The frustrating thing is that this exact same code has worked before!
> 
> Does anybody have an explanation?
> Thank You
> 
> I managed to simplify the application:
> 
> #include <stdio.h>
> #include "mpi.h"
> 
> int main(int iArgC, char *apArgV[]) {
>    int iResult = 0;
>    int iNumProcs = 0;
>    int iID = -1;
> 
>    MPI_Init(&iArgC, &apArgV);
> 
>    MPI_Comm_size(MPI_COMM_WORLD, &iNumProcs);
>    MPI_Comm_rank(MPI_COMM_WORLD, &iID);
> 
>    int iKey;
>    if (iID == 0) {
>        iKey = 0;
> 
>    } else {
>        iKey = 1;
>    }
> 
>    MPI_Comm  commInter1;
>    MPI_Comm  commInter2;
>    MPI_Comm  commIntra;
> 
>    MPI_Comm_split(MPI_COMM_WORLD, iKey, iID, &commIntra);
> 
>    int iRankM;
>    MPI_Comm_rank(commIntra, &iRankM);
>    printf("Local rank: %d\n", iRankM);
> 
>    switch (iKey) {
>    case 0:
>        printf("Creating intercomm 1 for Master (%d)\n", iID);
>        MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 1, 01, &commInter2);
>        break;
>    case 1:
>        printf("Creating intercomm 1 for FH (%d)\n", iID);
>        MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 0, 01, &commInter1);
>    }
> 
>    printf("finalizing\n");
>    MPI_Finalize();
> 
>    printf("exiting with %d\n", iResult);
>    return iResult;
> }
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to