On 23 May 2012 19:04, Jeff Squyres <jsquy...@cisco.com> wrote: > Thanks for all the info! > > But still, can we get a copy of the test in C? That would make it > significantly easier for us to tell if there is a problem with Open MPI -- > mainly because we don't know anything about the internals of mpi4py. >
FYI, this test ran fine with previous (but recent, let say 1.5.4) OpenMPI versions, but fails with 1.6. The test also runs fine with MPICH2. Sorry for the delay, but writing the test in C takes some time compared to Python. Also, it is a bit tiring for me to recode my tests to C everytime a new issue shows up with code I'm confident about, but I understand you really need something reproducible, so here you have. Find attached a C version of the test. See the output below, the test runs fine and shows the expected output for np=2,3,4,6,7 but something funny happens for np=5. [dalcinl@trantor tmp]$ mpicc allgather.c [dalcinl@trantor tmp]$ mpiexec -n 2 ./a.out [0] - [0] a [1] - [0] a [dalcinl@trantor tmp]$ mpiexec -n 3 ./a.out [0] - [0] ab [1] - [0] a [2] - [1] a [dalcinl@trantor tmp]$ mpiexec -n 4 ./a.out [3] - [1] ab [0] - [0] ab [1] - [1] ab [2] - [0] ab [dalcinl@trantor tmp]$ mpiexec -n 6 ./a.out [4] - [1] abc [5] - [2] abc [0] - [0] abc [1] - [1] abc [2] - [2] abc [3] - [0] abc [dalcinl@trantor tmp]$ mpiexec -n 7 ./a.out [5] - [2] abc [6] - [3] abc [0] - [0] abcd [1] - [1] abcd [2] - [2] abcd [3] - [0] abc [4] - [1] abc [dalcinl@trantor tmp]$ mpiexec -n 5 ./a.out [trantor:13791] *** An error occurred in MPI_Allgatherv [trantor:13791] *** on communicator [trantor:13791] *** MPI_ERR_COUNT: invalid count argument [trantor:13791] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- mpiexec has exited due to process rank 2 with PID 13789 on node trantor exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by mpiexec (as reported here). -------------------------------------------------------------------------- [trantor:13786] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [trantor:13786] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages -- Lisandro Dalcin --------------- CIMEC (INTEC/CONICET-UNL) Predio CONICET-Santa Fe Colectora RN 168 Km 472, Paraje El Pozo 3000 Santa Fe, Argentina Tel: +54-342-4511594 (ext 1011) Tel/Fax: +54-342-4511169
#include <mpi.h> #include <stdlib.h> #include <stdio.h> int main(int argc, char* argv[]) { int thlevel; void *sbuf = NULL; int scount = 0; MPI_Datatype stype = MPI_BYTE; void *rbuf = NULL; int *rcounts = NULL, sumrcounts; int *rdispls = NULL; MPI_Datatype rtype = MPI_BYTE; MPI_Comm comm; int worldrank,worldsize; int rank,size; int loop,i; //MPI_Init_thread(0,0,MPI_THREAD_MULTIPLE,&thlevel); MPI_Init(0,0); MPI_Comm_size(MPI_COMM_WORLD, &worldsize); MPI_Comm_rank(MPI_COMM_WORLD, &worldrank); { MPI_Comm intracomm; int color,local_leader,remote_leader; if (worldsize < 2) goto end; if (worldrank < worldsize/2) { color = 0; local_leader = 0; remote_leader = worldsize/2; } else { color = 1; local_leader = 0; remote_leader = 0; } MPI_Comm_split(MPI_COMM_WORLD,color,0,&intracomm); MPI_Intercomm_create(intracomm, local_leader, MPI_COMM_WORLD, remote_leader, 0, &comm); MPI_Comm_free(&intracomm); } MPI_Comm_rank(comm, &rank); MPI_Comm_remote_size(comm, &size); for (loop=0; loop<1; loop++) { scount = 1; sbuf = malloc(scount*sizeof(char)); ((char*)sbuf)[0] = 'a'+rank; rcounts = malloc(size*sizeof(int)); MPI_Allgather(&scount, 1, MPI_INT, rcounts, 1, MPI_INT, comm); rdispls = malloc(size*sizeof(int)); sumrcounts = 0; for (i=0; i<size; i++) { rdispls[i] = sumrcounts; sumrcounts += rcounts[i]; } rbuf = malloc(sumrcounts*sizeof(char)); MPI_Allgatherv(sbuf, scount, stype, rbuf, rcounts, rdispls, rtype, comm); MPI_Barrier(MPI_COMM_WORLD); printf("[%d] - [%d] ",worldrank,rank); for (i=0; i<sumrcounts; i++) { printf("%c",((char*)rbuf)[i]); } printf("\n");fflush(stdout); free(sbuf); free(rbuf); free(rcounts); free(rdispls); } end: MPI_Finalize(); }