Hi Jody,
I can paswordlessly ssh between all nodes (to and from)
Almost none of these mpirun commands work. The only working case is
if nodenameX is the node from which you are running the command. I
don't know if this gives you extra diagnostic information, but if I
explicitly set the wrong prefix (using --prefix), then I get errors
from all the nodes telling me the daemon would not start. I don't get
these errors normally. It seems to me that the communication is
working okay, at least in the outwards direction (and from all
nodes). Could this be a problem with forwarding of standard output?
If I were to try a simple hello world program, is this more likely to
work, or am I just adding another layer of complexity?
Cheers,
Hugh
On 28 Apr 2009, at 15:55, jody wrote:
Hi Hugh
You're right, there is no initialization command (like lamboot) you
have to call.
I don't really know why your sewtup doesn't work, so i'm making some
more "blind shots"
can you do passwordless ssh from between any two of your nodes?
does
mpirun -np 1 --host nodenameX uptime
work for every X when called from any of your nodes?
Have you tried
mpirun -np 2 --host nodename1,nodename2 uptime
(i.e. not using the host file)
Jody
On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson
<h.j.dickin...@durham.ac.uk> wrote:
Hi Jody,
The node names are exactly the same. I wanted to avoid updating
the version
because I'm not the system administrator, and it could take some
time before
it gets done. If it's likely to fix the problem though I'll try
it. I'm
assuming that I don't have to do something analogous to the old
"lamboot"
command to initialise Open MPI on all the nodes. I've seen no
documentation
anywhere that says I should.
Cheers,
Hugh
On 28 Apr 2009, at 15:28, jody wrote:
Hi Hugh
Again, just to make sure, are the hostnames in your host file
well-known?
I.e. when you say you can do
ssh nodename uptime
do you use exactly the same nodename in your host file?
(I'm trying to eliminate all non-Open-MPI error sources,
because with your setup it should basically work.)
One more point to consider is to update to Open-MPI 1.3.
I don't think your OPen-MPI version is the cause of your trouble,
but there have been quite some changes since v1.2.5
Jody
On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson
<h.j.dickin...@durham.ac.uk> wrote:
Hi Jody,
Indeed, all the nodes are running the same version of Open MPI.
Perhaps I
was incorrect to describe the cluster as heterogeneous. In fact,
all the
nodes run the same operating system (Scientific Linux 5.2), it's
only the
hardware that's different and even then they're all i386 or
i686. I'm
also
attaching the output of ompi_info --all as I've seen it's
suggested in
the
mailing list instructions.
Cheers,
Hugh
Hi Hugh
Just to make sure:
You have installed Open-MPI on all your nodes?
Same version everywhere?
Jody
On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson
<h.j.dickinson_at_[hidden]> wrote:
Hi all,
First of all let me make it perfectly clear that I'm a complete
beginner
as
far as MPI is concerned, so this may well be a trivial problem!
I've tried to set up Open MPI to use SSH to communicate between
nodes on
a
heterogeneous cluster. I've set up passwordless SSH and it
seems to be
working fine. For example by hand I can do:
ssh nodename uptime
and it returns the appropriate information for each node.
I then tried running a non-MPI program on all the nodes at the
same
time:
mpirun -np 10 --hostfile hostfile uptime
Where hostfile is a list of the 10 cluster node names with
slots=1 after
each one i.e
nodename1 slots=1
nodename2 slots=2
etc...
Nothing happens! The process just seems to hang. If I interrupt
the
process
with Ctrl-C I get:
"
mpirun: killing job...
[gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG:
Timeout in
file
base/pls_base_orted_cmds.c at line 275
[gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG:
Timeout in
file
pls_rsh_module.c at line 1166
------------------------------------------------------------------
--------
WARNING: mpirun has exited before it received notification that
all
started processes had terminated. You should double check and
ensure
that there are no runaway processes still executing.
------------------------------------------------------------------
--------
"
If, instead of using the hostfile, I specify on the command
line the
host
from which I'm running mpirun, e.g.:
mpirun -np 1 --host nodename uptime
then it works (i.e. if it doesn't need to communicate with
other nodes).
Do
I need to tell Open MPI it should be using SSH to communicate?
If so,
how
do
I do this? To be honest I think it's trying to do so, because
before I
set
up passwordless SSH it challenged me for lots of passwords.
I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2.
Let me
reiterate, it's very likely that I've done something stupid, so
all
suggestions are welcome.
Cheers,
Hugh
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users