On Mar 10, 2006, at 6:01 AM, Cezary Sliwa wrote:
http://www.open-mpi.org/community/lists/users/2006/02/0712.php
I have a simple program in which the rank 0 task dispatches compute
tasks to other processes. It works fine on one 4-way SMP machine, but
when I try to run it on two nodes, the processes on the other machine
seem to spin in a loop inside MPI_SEND (a message is not delivered).
You still haven't answered whether your application does any of the
things that I mentioned in my first post. :-) Have you examined the
code to ensure that your application does not rely on buffering?
This kind of thing can easily show up as blocking in some situations
and not blocking in others (such as on-node vs. off-node communication).
If it does not, can you send the information requested by the
"Getting Help" section of the Open MPI web site? This will give us
more details that will hopefully enable us to resolve your problem:
http://www.open-mpi.org/community/help/
One additional question: are you using TCP as your communications
network, and if so, do either of the nodes that you are running on
have more than one TCP NIC? We recently fixed a bug for situations
where at least one node in on multiple TCP networks, not all of which
were shared by the nodes where the peer MPI processes were running.
If this situation describes your network setup (e.g., a cluster where
the head node has a public and a private network, and where the
cluster nodes only have a private network -- and your MPI process was
running on the head node and a compute node), can you try upgrading
to the latest 1.0.2 release candidate tarball:
http://www.open-mpi.org/software/ompi/v1.0/
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/