Everything is working properly now. I needed to reinstall Linux on one of my nodes after a botched attempt at a network install - mpirun ... hostname worked, but my application hung and gave a connect() errno 110.
At this point I decided to give up and try mpich instead. During the mpich sanity checking, there was a more verbose error message regarding the failed node, so I reinstalled the OS, reconfigured my environment variables for OpenMPI and everything is now working. Thanks for the help and support so far, Mark Kosmowski On 2/7/07, Mark Kosmowski <mark.kosmow...@gmail.com> wrote:
Dear Open-MPI list: I'm trying to run two (soon to be three) dual opteron machines as a cluster (network of workstations - they each have a disk and OS). I can ssh between machines with no password. My open-mpi code compiled fine and works great as an SMP program (using both processors on one machine). However, I am not able to run my open-mpi program parallel between the two computers. For SMP work I use: mpirun -np 2 myprogram inputfile >outputfile For cluster work I have tried: mpirun --hostfile myhostfile -np 4 myprogram inputfile >outputfile which does not write to the output file. I have also tried: mpirun --hostfile myhostfile -np 4 `myprogram inputfile >outputfile` which just ran serially on the initial machine. The open-mpi executable and libraries are on the head node NFS shared to the slave node. Both computers can run open-mpi [the open-mpi application] as an SMP program with no problems. When I am trying to run the open-mpi program with both computers, I am using a directory that is an NFS share to the other computer. I am running OpenSUSE 10.2 on both machines. I compiled with gcc 41 / ifort 9.1. I am using a gigabit network. My hostfile specifies slots=2 max-slots=2 for each computer. The computers are identified in the hostfile using the /etc/hosts alias. The only config.log that I found was in the directory I used to make open-mpi; since everything works as SMP, I am not including that file with this initial message. What should I be trying to do next to remedy this issue? Any help would be appreciated. Thanks, Mark Kosmowski