Hi
I've got a really strange problem:

I've got an application which creates intercommunicators between a
master and some workers.

When i run it on our cluster with 11  processes it works,
when i run it with 12 processes it hangs inside MPI_Intercomm_create().

This is the hostfile:
  squid_0.uzh.ch  slots=3  max-slots=3
  squid_1.uzh.ch  slots=2  max-slots=2
  squid_2.uzh.ch  slots=1  max-slots=1
  squid_3.uzh.ch  slots=1  max-slots=1
  triops.uzh.ch   slots=8 max-slots=8

Actually all squid_X have 4 cores, but i managed to reduce the number of
processes needed for failure by making the above settings.

So with all available squid cores and 3 triops cores it works,
but with 4 triops cores it hangs.

On the other hand, if i use all 16 squid cores (but no triops cores)
it works, too.

If i start the application not from triopps, but froim another workstation,
i have a similar pattern of Intercomm_create failures.

Note that with the above hostfile a simple HelloMPI works also with 14
or more processes.

The frustrating thing is that this exact same code has worked before!

Does anybody have an explanation?
Thank You

I managed to simplify the application:

#include <stdio.h>
#include "mpi.h"

int main(int iArgC, char *apArgV[]) {
    int iResult = 0;
    int iNumProcs = 0;
    int iID = -1;

    MPI_Init(&iArgC, &apArgV);

    MPI_Comm_size(MPI_COMM_WORLD, &iNumProcs);
    MPI_Comm_rank(MPI_COMM_WORLD, &iID);

    int iKey;
    if (iID == 0) {
        iKey = 0;

    } else {
        iKey = 1;
    }

    MPI_Comm  commInter1;
    MPI_Comm  commInter2;
    MPI_Comm  commIntra;

    MPI_Comm_split(MPI_COMM_WORLD, iKey, iID, &commIntra);

    int iRankM;
    MPI_Comm_rank(commIntra, &iRankM);
    printf("Local rank: %d\n", iRankM);

    switch (iKey) {
    case 0:
        printf("Creating intercomm 1 for Master (%d)\n", iID);
        MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 1, 01, &commInter2);
        break;
    case 1:
        printf("Creating intercomm 1 for FH (%d)\n", iID);
        MPI_Intercomm_create(commIntra, 0, MPI_COMM_WORLD, 0, 01, &commInter1);
    }

    printf("finalizing\n");
    MPI_Finalize();

    printf("exiting with %d\n", iResult);
    return iResult;
}

Reply via email to