Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread G.O.

On 7/17/07, Bill Johnstone  wrote:

Hello all.

I could really use help trying to figure out why mpirun is hanging as
detailed in my previous message yesterday, 16 July.  Since there's been
no response, please allow me to give a short summary.

-Open MPI 1.2.3 on GNU/Linux, 2.6.21 kernel, gcc 4.1.2, bash 3.2.15 is
default shell
-Open MPI installed to /usr/local, which is in non-interactive session
path
-Systems are AMD64, using ethernet as interconnect, on private IP
network

mpirun hangs whenever I invoke any process running on a remote node.
It runs a job fine if I invoke it so that it only runs on the local
node.  Ctrl+C never successfully cancels an mpirun job -- I have to use
kill -9.

I'm asking for help trying to figure what steps have been taken by
mpirun, and how I can figure out where things are getting stuck /
crashing.  What could be happening on the remote nodes?  What debugging
steps can I take?

Without MPI running, the cluster is of no use, so I would really
appreciate some help here.



   1- Check to make sure that there are no firewalls blocking traffic
between the nodes.
   2 - Check to make sure that all nodes have the openmpi installed
and have the very same executable you are trying to run on the same
path, have all permissions correctly.
   3- Check to make sure that all nodes have the same interface, i.e. eth0 .

  That's all i can think of for very quick checks for now. Hope it's
one of this.

  Thanks,
 gurhan





Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread G.O.

On 7/17/07, Bill Johnstone  wrote:

> 2 - Check to make sure that all nodes have the openmpi installed
> and have the very same executable you are trying to run on the same
> path, have all permissions correctly.

Yes, they are all installed to /usr/local , the permissions are the
same, and if I just invoke mpirun on an individual node by logging into
it, it works.  In fact, even commands like "ssh node4 mpirun" (just to
get the mpirun help banner) work.



   How about the executable you are trying to have mpirun to execute?
Does it exist on all nodes , in the same path with correct
permissions?

  Thanks,
  Gurhan