Sounds like a typical deadlock situation. All processors are waiting for one another.
Not a specialist but from what I know if the messages are small enough they'll be offloaded to kernel/hardware and there is no deadlock. That why it might work for small messages and/or certain mpi implementations. Solutions: - come up with a global communication schedule such that if one processor sends the receiver is receiving. - use mpi_bsend. Might be slower. - use mpi_isend, mpi_irecv (but then you'll have to make sure the buffers stay valid for the duration of the communication) On Friday 13 June 2008 01:55, zach wrote: > I have a weird problem that shows up when i use LAM or OpenMPI but not > MPICH. > > I have a parallelized code working on a really large matrix. It > partitions the matrix column-wise and ships them off to processors, > so, any given processor is working on a matrix with the same number of > rows as the original but reduced number of columns. Each processor > needs to send a single column vector entry > from its own matrix to the adjacent processor and visa versa as part > of the algorithm. > > I have found that depending on the number of rows of the matrix -or, > the size of the vector being sent using MPI_Send, MPI_Recv, the > simulation will hang. > It is only until i reduce this dimension to a certain max number will > the sim run properly. I have also found that this magic number differs > depending on the system I am using, eg my home quad-core box or remote > cluster. > > As i mentioned i have not had this issue with mpich. I would like to > understand why it is happening rather than just defect over to mpich > to get by. > > Any help would be appreciated! > zach > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users