Sounds like a typical deadlock situation. All processors are waiting for one 
another.

Not a specialist but from what I know if the messages are small enough they'll 
be offloaded to kernel/hardware and there is no deadlock. That why it might 
work for small messages and/or certain mpi implementations.

Solutions:
- come up with a global communication schedule such that if one processor 
sends the receiver is receiving.
- use mpi_bsend. Might be slower.
- use mpi_isend, mpi_irecv (but then you'll have to make sure the buffers stay 
valid for the duration of the communication)

On Friday 13 June 2008 01:55, zach wrote:
> I have a weird problem that shows up when i use LAM or OpenMPI but not
> MPICH.
>
> I have a parallelized code working on a really large matrix. It
> partitions the matrix column-wise and ships them off to processors,
> so, any given processor is working on a matrix with the same number of
> rows as the original but reduced number of columns. Each processor
> needs to send a single column vector entry
> from its own matrix to the adjacent processor and visa versa as part
> of the algorithm.
>
> I have found that depending on the number of rows of the matrix -or,
> the size of the vector being sent using MPI_Send, MPI_Recv, the
> simulation will hang.
> It is only until i reduce this dimension to a certain max number will
> the sim run properly. I have also found that this magic number differs
> depending on the system I am using, eg my home quad-core box or remote
> cluster.
>
> As i mentioned i have not had this issue with mpich. I would like to
> understand why it is happening rather than just defect over to mpich
> to get by.
>
> Any help would be appreciated!
> zach
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to