On May 4, 2013, at 4:54 PM, Angel de Vicente <ang...@iac.es> wrote: > Hi, > > I have used OpenMPI before without any troubles, and configured MPICH, > MPICH2 and OpenMPI in many different machines before, but recently we > upgraded the OS to Fedora 17, and now I'm having trouble running an MPI > code in two of our machines connected via a switch. > > I thought perhaps the old installation was giving problems, so I > reinstalled OpenMPI (1.6.4) and I have no trouble when running a > parallel code in just one node. I also don't have any trouble ssh'ing > (without need for password) between these machines, but when I try to > run a parallel job spanning both machines, I get a hanged mpiexec > process in the submitting machine, and an "orted" process in the other > machine, but nothing moves. > > I guess it is an issue with libraries and/or different MPI versions (the > machines have other site-wide MPI libraries installed), but I'm not sure > how to debug the issue. I looked in the FAQ, but I didn't find anything > relevant. Issue > http://www.open-mpi.org/faq/?category=running#intel-compilers-static is > different, since I don't get any warning or errors when running, just > all processes stuck. > > Is there any way to dump details of what OpenMPI is trying to do in each > node, so I can see if it is looking for different libraries in each > node, or something similar?
What I do is simply "ssh ompi_info -V" to each remote node and compare results - you should get the same answer everywhere. Another option in these situations is to configure --enable-orterun-prefix-by-default. If you install in the same location on each node (e.g., on an NSF mount), then this will ensure you get that same library. > > Thanks, > -- > Ángel de Vicente > http://angel-de-vicente.blogspot.com/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users