On May 4, 2013, at 4:54 PM, Angel de Vicente <ang...@iac.es> wrote:

> Hi,
> 
> I have used OpenMPI before without any troubles, and configured MPICH,
> MPICH2 and OpenMPI in many different machines before, but recently we
> upgraded the OS to Fedora 17, and now I'm having trouble running an MPI
> code in two of our machines connected via a switch.
> 
> I thought perhaps the old installation was giving problems, so I
> reinstalled OpenMPI (1.6.4) and I have no trouble when running a
> parallel code in just one node. I also don't have any trouble ssh'ing
> (without need for password) between these machines, but when I try to
> run a parallel job spanning both machines, I get a hanged mpiexec
> process in the submitting machine, and an "orted" process in the other
> machine, but nothing moves. 
> 
> I guess it is an issue with libraries and/or different MPI versions (the
> machines have other site-wide MPI libraries installed), but I'm not sure
> how to debug the issue. I looked in the FAQ, but I didn't find anything
> relevant. Issue
> http://www.open-mpi.org/faq/?category=running#intel-compilers-static is
> different, since I don't get any warning or errors when running, just
> all processes stuck. 
> 
> Is there any way to dump details of what OpenMPI is trying to do in each
> node, so I can see if it is looking for different libraries in each
> node, or something similar?

What I do is simply "ssh ompi_info -V" to each remote node and compare results 
- you should get the same answer everywhere.

Another option in these situations is to configure 
--enable-orterun-prefix-by-default. If you install in the same location on each 
node (e.g., on an NSF mount), then this will ensure you get that same library.


> 
> Thanks,
> -- 
> Ángel de Vicente
> http://angel-de-vicente.blogspot.com/
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to