Everything is working properly now.  I needed to reinstall Linux on
one of my nodes after a botched attempt at a network install - mpirun
... hostname worked, but my application hung and gave a connect()
errno 110.

At this point I decided to give up and try mpich instead.  During the
mpich sanity checking, there was a more verbose error message
regarding the failed node, so I reinstalled the OS, reconfigured my
environment variables for OpenMPI and everything is now working.

Thanks for the help and support so far,

Mark Kosmowski

On 2/7/07, Mark Kosmowski <mark.kosmow...@gmail.com> wrote:
Dear Open-MPI list:

I'm trying to run two (soon to be three) dual opteron machines as a
cluster (network of workstations - they each have a disk and OS).  I
can ssh between machines with no password.  My open-mpi code compiled
fine and works great as an SMP program (using both processors on one
machine).  However, I am not able to run my open-mpi program parallel
between the two computers.

For SMP work I use:

mpirun -np 2 myprogram inputfile >outputfile

For cluster work I have tried:

mpirun --hostfile myhostfile -np 4 myprogram inputfile >outputfile

which does not write to the output file.

I have also tried:

mpirun --hostfile myhostfile -np 4 `myprogram inputfile >outputfile`

which just ran serially on the initial machine.

The open-mpi executable and libraries are on the head node NFS shared
to the slave node.  Both computers can run open-mpi [the open-mpi
application] as an SMP program with no problems.  When I am trying to
run the open-mpi program with both computers, I am using a directory
that is an NFS share to the other computer.

I am running OpenSUSE 10.2 on both machines.  I compiled with gcc 41 /
ifort 9.1.

I am using a gigabit network.

My hostfile specifies slots=2 max-slots=2 for each computer.  The
computers are identified in the hostfile using the /etc/hosts alias.

The only config.log that I found was in the directory I used to make
open-mpi; since everything works as SMP, I am not including that file
with this initial message.

What should I be trying to do next to remedy this issue?

Any help would be appreciated.

Thanks,

Mark Kosmowski

Reply via email to