it is a good question I asked it myself at the first but then I said it should be correct but anyway I want to confirm that: her is the code snippet of the program: ... int ranks[size]; for(i=0; i < size; ++i) { ranks[i] = i; } ...
for(p=8; p <= (size); p+=4) { MPI_Barrier(MPI_COMM_WORLD); if(!grid_init(p, 1)) continue; if( (p>=m) || (p>=k) || (p>=n) ) break; MPI_Group_incl(world_group, p, ranks, &working_group); MPI_Comm_create(MPI_COMM_WORLD, working_group, &working_comm); if(working_comm != MPI_COMM_NULL) { ... variant_run(&variant5, C, m, k, n, my_rank, p, working_comm); ... MPI_Group_free(&working_group); MPI_Comm_free(&working_comm); } Inside variant_run, it calls this function where the error is: void Compute_SUMMA1(Matrix* A, Matrix* B, Matrix *C, size_t M, size_t K, size_t N, size_t my_rank, size_t size, MPI_Comm comm) { C->block_matrix = gsl_matrix_calloc(A->block_matrix->size1, B->block_matrix->size2); C->distribution_type = TwoD_Block; MPI_Comm grid_comm; int dim[2], period[2], reorder = 0, ndims = 2; int coord[2], id; dim[0] = global.PR; dim[1] = global.PC; period[0] = 0; period[1] = 0; int ss, rr; MPI_Group comm_group; MPI_Comm_group(comm, &comm_group ); MPI_Group_size( comm_group, &ss); MPI_Group_rank( comm_group, &rr); if(ss == 6) { //printf("M %d K %d N %d //printf("my_rank in comm %d my_rank in world_comm %d\n", rr, my_rank); //printf(" comm size %d my_rank in comm %d my_rank in world_comm %d\n", ss, rr, my_rank); //printf("SUMMA ... PR %d PC %d\n", global.PR, global.PC); } //MPI_Barrier(comm); // if(my_rank == 0) // printf("my_rank %d ndims %d dim[0] %d dim[1] %d period[0] %d period[1] %d reorder %d\n", // my_rank, ndims, dim[0], dim[1], period[0], period[1], reorder); // if(comm == MPI_COMM_NULL) // printf("my_rank %d comm is empty\n", my_rank); // MPI_Cart_create(comm, ndims, dim, period, reorder, &grid_comm); MPI_Comm Acomm, Bcomm; // create column subgrids int remain[2]; //, mdims, dims[2], row_coords[2]; remain[0] = 1; remain[1] = 0; MPI_Cart_sub(grid_comm, remain, &Bcomm); remain[0] = 0; remain[1] = 1; MPI_Cart_sub(grid_comm, remain, &Acomm); ... } As you can see, all ranks will call grid_init which is a global func that returns the grid dims, if it is executed for ranks 24 will produce 4X6 and for 96 produce 8X12 and will store the result in global structure with PR and PC. As it is executed by all prcesses and I checked for rank 0 and some other processes and the result is correct so I assume it should be correct for all other processes. So the grid_comm is correct which is an input to MPI_Cart_sub. The ranks in the working_comm and in MPI_COMM_WORLD should be the same and this should be correct and it is according to filling the rank array at the beginning of this code snippet. On Tue, Jan 10, 2012 at 5:25 PM, Jeff Squyres <jsquy...@cisco.com> wrote: > This may be a dumb question, but are you 100% sure that the input values > are correct? > > On Jan 10, 2012, at 8:16 AM, Anas Al-Trad wrote: > > > Hi Ralph, I changed the intel icc module from 12.1.0 to 11.1.069, the > previous default one used at a Neolith Cluster. I submitted the job and I > still waiting for the result. Here is the message of the segmentation fault: > > > > [n764:29867] *** Process received signal *** > > [n764:29867] Signal: Floating point exception (8) > > [n764:29867] Signal code: Integer divide-by-zero (1) > > [n764:29867] Failing at address: 0x2ba640e74627 > > [n764:29867] [ 0] /lib64/libc.so.6 [0x2ba641e162d0] > > [n764:29867] [ 1] > /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_coords+0x43) > [0x2ba640e74627] > > [n764:29867] [ 2] > /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_sub+0x1d5) > [0x2ba640e74acd] > > [n764:29867] [ 3] > /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(MPI_Cart_sub+0x35) > [0x2ba640e472d9] > > [n764:29867] [ 4] > /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(Compute_SUMMA1+0x226) > [0x4088da] > > [n764:29867] [ 5] > /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(variant_run+0xb2) > [0x409058] > > [n764:29867] [ 6] > /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(main+0xf90) [0x40eeba] > > [n764:29867] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x2ba641e03994] > > [n764:29867] [ 8] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o > [0x403fd9] > > [n764:29867] *** End of error message *** > > > > when I run my application, sometimes I get this error and sometimes it > is stuck in the middle. > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >