Hi I've got a really strange problem: I've got an application which creates intercommunicators between a master and some workers.
When i run it on our cluster with 11 processes it works, when i run it with 12 processes it hangs inside MPI_Intercomm_create(). This is the hostfile: squid_0.uzh.ch slots=3 max-slots=3 squid_1.uzh.ch slots=2 max-slots=2 squid_2.uzh.ch slots=1 max-slots=1 squid_3.uzh.ch slots=1 max-slots=1 triops.uzh.ch slots=8 max-slots=8 Actually all squid_X have 4 cores, but i managed to reduce the number of processes needed for failure by making the above settings. So with all available squid cores and 3 triops cores it works, but with 4 triops cores it hangs. On the other hand, if i use all 16 squid cores (but no triops cores) it works, too. If i start the application not from triopps, but froim another workstation, i have a similar pattern of Intercomm_create failures. Note that with the above hostfile a simple HelloMPI works also with 14 or more processes. The frustrating thing is that this exact same code has worked before! Does anybody have an explanation? Thank You I managed to simplify the application: #include <stdio.h> #include "mpi.h" int main(int iArgC, char *apArgV[]) { int iResult = 0; int iNumProcs = 0; int iID = -1; MPI_Init(&iArgC, &apArgV); MPI_Comm_size(MPI_COMM_WORLD, &iNumProcs); MPI_Comm_rank(MPI_COMM_WORLD, &iID); int iKey; if (iID == 0) { iKey = 0; } else { iKey = 1; } MPI_Comm commInter1; MPI_Comm commInter2; MPI_Comm commIntra; MPI_Comm_split(MPI_COMM_WORLD, iKey, iID, &commIntra); int iRankM; MPI_Comm_rank(commIntra, &iRankM); printf("Local rank: %d\n", iRankM); switch (iKey) { case 0: printf("Creating intercomm 1 for Master (%d)\n", iID); MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 1, 01, &commInter2); break; case 1: printf("Creating intercomm 1 for FH (%d)\n", iID); MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 0, 01, &commInter1); } printf("finalizing\n"); MPI_Finalize(); printf("exiting with %d\n", iResult); return iResult; }