[OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Anas Al-Trad
Dear people,
   In my application, I have the segmentation fault of Integer
Divide-by-zero when calling MPI_cart_sub routine. My program is as follows,
I have 128 ranks, I make a new communicator of the first 96 ranks via
MPI_Comm_creat. Then I create a grid of 8X12 by calling MPI_Cart_create.
After creating the grid if I call MPI_Cart_sub then I have that error.

This error happens also when I use a communicator of 24 ranks and create a
grid of 4X6. Can you please help me in solving this?

Regards,
Anas


Re: [OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Anas Al-Trad
Thanks Paul,
yes I use Intel 12.1.0, and this error is intermittent, not always produced
but most of the times it occurs.
My program is large and contains many files that are related to each other,
I don't think it will help if I take the snippet of the code. The program
run parallel matrix multiplication algorithms. I don't know if it is
because of my code or not, but I run the program for small matrices sizes
and the program completes until the end without error while for large
inputs it will hang or give that sigv.

Regards,
Anas


Re: [OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Anas Al-Trad
 Hi Ralph, I changed the intel icc module from 12.1.0 to 11.1.069, the
previous default one used at a Neolith Cluster. I submitted the job and I
still waiting for the result. Here is the message of the segmentation fault:

[n764:29867] *** Process received signal ***
[n764:29867] Signal: Floating point exception (8)
[n764:29867] Signal code: Integer divide-by-zero (1)
[n764:29867] Failing at address: 0x2ba640e74627
[n764:29867] [ 0] /lib64/libc.so.6 [0x2ba641e162d0]
[n764:29867] [ 1]
/software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_coords+0x43)
[0x2ba640e74627]
[n764:29867] [ 2]
/software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_sub+0x1d5)
[0x2ba640e74acd]
[n764:29867] [ 3]
/software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(MPI_Cart_sub+0x35)
[0x2ba640e472d9]
[n764:29867] [ 4]
/home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(Compute_SUMMA1+0x226)
[0x4088da]
[n764:29867] [ 5]
/home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(variant_run+0xb2)
[0x409058]
[n764:29867] [ 6]
/home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(main+0xf90) [0x40eeba]
[n764:29867] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2ba641e03994]
[n764:29867] [ 8] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o
[0x403fd9]
[n764:29867] *** End of error message ***

when I run my application, sometimes I get this error and sometimes it is
stuck in the middle.


Re: [OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Anas Al-Trad
it is a good question I asked it myself at the first but then I said it
should be correct but anyway I want to confirm that:
her is the code snippet of the program:
...
int ranks[size];
for(i=0; i < size; ++i)
{
ranks[i] = i;
}
...

for(p=8; p <= (size); p+=4)
{
  MPI_Barrier(MPI_COMM_WORLD);
  if(!grid_init(p, 1)) continue;
  if( (p>=m) || (p>=k) || (p>=n) )
break;

  MPI_Group_incl(world_group, p, ranks, &working_group);
  MPI_Comm_create(MPI_COMM_WORLD, working_group, &working_comm);

  if(working_comm != MPI_COMM_NULL)
  {
...
variant_run(&variant5, C, m, k, n, my_rank, p, working_comm);
...
MPI_Group_free(&working_group);
MPI_Comm_free(&working_comm);
}

Inside variant_run, it calls this function where the error is:
void Compute_SUMMA1(Matrix* A, Matrix* B, Matrix *C, size_t M, size_t K,
size_t N, size_t my_rank, size_t size, MPI_Comm comm)
{
C->block_matrix = gsl_matrix_calloc(A->block_matrix->size1,
B->block_matrix->size2);
C->distribution_type = TwoD_Block;

MPI_Comm grid_comm;
int dim[2], period[2], reorder = 0, ndims = 2;
int coord[2], id;

dim[0] = global.PR; dim[1] = global.PC;
period[0] = 0; period[1] = 0;

int ss, rr;
MPI_Group comm_group;
MPI_Comm_group(comm, &comm_group );
MPI_Group_size( comm_group, &ss);
MPI_Group_rank( comm_group, &rr);
if(ss == 6)
{
//printf("M %d K %d N %d
//printf("my_rank in comm %d   my_rank in world_comm %d\n", rr, my_rank);
//printf(" comm size %d  my_rank in comm %d   my_rank in world_comm %d\n",
ss, rr, my_rank);
//printf("SUMMA ... PR %d  PC %d\n", global.PR, global.PC);
}
//MPI_Barrier(comm);
// if(my_rank == 0)
// printf("my_rank %d  ndims %d  dim[0] %d  dim[1] %d  period[0] %d
 period[1] %d  reorder %d\n",
//my_rank, ndims, dim[0], dim[1], period[0], period[1], reorder);
// if(comm == MPI_COMM_NULL)
//   printf("my_rank %d  comm is empty\n", my_rank);
//
MPI_Cart_create(comm, ndims, dim, period, reorder, &grid_comm);

MPI_Comm Acomm, Bcomm;

// create column subgrids
int remain[2]; //, mdims, dims[2], row_coords[2];
remain[0] = 1;
remain[1] = 0;
MPI_Cart_sub(grid_comm, remain, &Bcomm);

remain[0] = 0;
remain[1] = 1;
MPI_Cart_sub(grid_comm, remain, &Acomm);
...
}


As you can see, all ranks will call grid_init which is a global func that
returns the grid dims, if it is executed for ranks 24 will produce 4X6 and
for 96 produce 8X12 and will store the result in global structure with PR
and PC. As it is executed by all prcesses and I checked for rank 0 and some
other processes and the result is correct so I assume it should be correct
for all other processes.

So the grid_comm is correct which is an input to MPI_Cart_sub. The ranks in
the working_comm and in MPI_COMM_WORLD should be the same and this should
be correct and it is according to filling the rank array at the beginning
of this code snippet.



On Tue, Jan 10, 2012 at 5:25 PM, Jeff Squyres  wrote:

> This may be a dumb question, but are you 100% sure that the input values
> are correct?
>
> On Jan 10, 2012, at 8:16 AM, Anas Al-Trad wrote:
>
> >  Hi Ralph, I changed the intel icc module from 12.1.0 to 11.1.069, the
> previous default one used at a Neolith Cluster. I submitted the job and I
> still waiting for the result. Here is the message of the segmentation fault:
> >
> > [n764:29867] *** Process received signal ***
> > [n764:29867] Signal: Floating point exception (8)
> > [n764:29867] Signal code: Integer divide-by-zero (1)
> > [n764:29867] Failing at address: 0x2ba640e74627
> > [n764:29867] [ 0] /lib64/libc.so.6 [0x2ba641e162d0]
> > [n764:29867] [ 1]
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_coords+0x43)
> [0x2ba640e74627]
> > [n764:29867] [ 2]
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_sub+0x1d5)
> [0x2ba640e74acd]
> > [n764:29867] [ 3]
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(MPI_Cart_sub+0x35)
> [0x2ba640e472d9]
> > [n764:29867] [ 4]
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(Compute_SUMMA1+0x226)
> [0x4088da]
> > [n764:29867] [ 5]
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(variant_run+0xb2)
> [0x409058]
> > [n764:29867] [ 6]
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(main+0xf90) [0x40eeba]
> > [n764:29867] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2ba641e03994]
> > [n764:29867] [ 8] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o
> [0x403fd9]
> > [n764:29867] *** End of error message ***
> >
> > when I run my application, sometimes I get this error and sometimes it
> is stuck in the middl

Re: [OMPI users] SIGV at MPI_Cart_sub

2012-01-10 Thread Anas Al-Trad
Anyway, after compiling my code with icc/11.1.069, the job is running
without stuck or that sigv which it occurred before when using icc/12.1.0
module.

Also I have to point that when I was using icc/12.1.0 I was getting strange
outputs or stuck, and I solved them by changing the name of parameters
inside the function, for example, if I call a func like this

time( ..., size_t *P, ...){}

and call it like this:
time(..,p,..);

then I have to change the name of *P inside the time functions as follows:
time( ..., size_t *P, ...)
{
int bestP = *P; // and maybe again as the later bug that I solved
int bP = bestP;
// then start using bP :)
...
}

Thanks guys for the help, I guess that the problem is solved when compiling
with the old one.