Got it.   Building a new openMPI solved it.

I don't know if the standard Ubuntu install was the problem or if it just 
didn't like the slightly later kernel.
Seems to be reason to be suspicious of Ubuntu 10.10 OpenMPI builds if you have 
anything unusual in your system.
Thanks.
--- On Tue, 12/7/11, Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] Mpirun only works when n< 3
To: randolph_pul...@yahoo.com.au
Cc: "Open MPI Users" <us...@open-mpi.org>
Received: Tuesday, 12 July, 2011, 10:29 PM

On Jul 11, 2011, at 11:31 AM, Randolph Pullen wrote:

> There are no firewalls by default.  I can ssh between both nodes without a 
> password so I assumed that all is good with the comms.

FWIW, ssh'ing is different than "comms" (which I assume you mean opening random 
TCP sockets between two servers).

> I can also get both nodes to participate in the ring program at the same time.
> Its just that I am limited to inly 2 processes if they are split between the 
> nodes
> ie:
> mpirun -H A,B ring                         (works)
> mpirun -H A,A,A,A,A,A,A  ring     (works)
> mpirun -H B,B,B,B ring                 (works)
> mpirun -H A,B,A  ring                    (hangs)

It is odd that A,B works and A,B,A does not.

> I have discovered slightly more information:
> When I replace node 'B' from the new cluster with node 'C' from the old 
> cluster
> I get the similar behavior but with an error message:
> mpirun -H A,A,A,A,A,A,A  ring     (works from either node)
> mpirun -H C,C,C  ring     (works from either node)
> mpirun -H A,C  ring     (Fails from either node:)
> Process 0 sending 10 to 1, tag 201 (3 processes in ring)
> [C:23465] ***  An error occurred in MPI_Recv
> [C:23465] ***  on communicator MPI_COMM_WORLD
> [C:23465] ***  MPI_ERRORS_ARE FATAL (your job will now abort)
> Process 0 sent to 1
> ----------------------------------
> Running this on either node A or C produces the same result
> Node C runs openMPI 1.4.1 and is an ordinary dual core on FC10 , not an i5 
> 2400 like the others.
> all the binaries are compiled on FC10 with gcc 4.3.2


Are you sure that all the versions of Open MPI being used on all nodes are 
exactly the same?  I.e., are you finding/using Open MPI v1.4.1 on all nodes?

Are the nodes homogeneous in terms of software?  If they're heterogeneous in 
terms of hardware, you *might* need to have separate OMPI installations on each 
machine (vs., for example, a network-filesystem-based install shared to all 3) 
because the compiler's optimizer may produce code tailored for one of the 
machines, and it may therefore fail in unexpected ways on the other(s).  The 
same is true for your executable.

See this FAQ entry about heterogeneous setups:

    http://www.open-mpi.org/faq/?category=building#where-to-install

...hmm.  I could have sworn we had more on the FAQ about heterogeneity, but 
perhaps not.  The old LAM/MPI FAQ on heterogeneity is somewhat outdated, but 
most of its concepts are directly relevant to Open MPI as well:

    http://www.lam-mpi.org/faq/category11.php3

I should probably copy most of that LAM/MPI heterogeneous FAQ to the Open MPI 
FAQ, but it'll be waaay down on my priority list.  :-(  If anyone could help 
out here, I'd be happy to point them in the right direction to convert the 
LAM/MPI FAQ PHP to Open MPI FAQ PHP...  

To be clear: the PHP conversion will be pretty trivial; I stole heavily from 
the LAM/MPI FAQ PHP to create the Open MPI FAQ PHP -- but there are points 
where the LAM/MPI heterogeneity text needs to be updated; that'll take an hour 
or two to update all that content.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to