One thing that I note is that you are using a fairly ancient
development version -- the development snapshots tend to change
pretty quickly (usually nightly). The version you cited is
1.2a1r10111 (which I think is about 5-6 months ago), but the current
development head is r12737.
Indeed, we've had some fairly important run-time changes over the
past 2-3 weeks. Can you update to a more recent copy and try again?
On Dec 4, 2006, at 3:44 AM, Jens Klostermann wrote:
What the system is saying is that (a) you don't have transparent ssh
authority on one or more of your nodes, and/or (b) the system was
unable
to
locate the Open MPI code libraries on the remote node. For the first
problem, please see the FAQ at:
http://www.open-mpi.org/faq/?category=rsh#ssh-keys
Once you have that fixed, then you should check the remote nodes to
ensure
that the Open MPI code libraries are available - you may need to
provide
a
prefix directory to mpirun to tell us where they are. Please see the
FAQ
at:
http://www.open-mpi.org/faq/?category=running
For some advice in that area.
Hope that helps
Ralph
I think these suggestions: (a) nontransparent ssh authority and (b)
being unable to locate the Open MPI code libraries on the remote node
are not the problems.
(a)Passwordless ssh is setup and all nodes see the same home!
(b)the Open MPI code libraries are located in my home which sees every
node.
mpirun sometimes works with all cpus/nodes of the cluster, but
sometimes
it won't and the error described below will occur.
On 12/1/06 8:17 AM, "Jens Klostermann"
<jens.klostermann_at_[hidden]> wrote:
I 've got the same problem as described in:
http://www.open-mpi.org/community/lists/users/2006/07/1537.php
From: Chengwen Chen (chenchengwen_at_[hidden])
Date: 2006-07-04 03:53:26
The problem seems to occur randomly! It occurs more often if I use a
larger number of cpu, but always never if I use a small number of
cpus.
So far my cure to the problem is to kill and restart my application
(mpirun ...) as often untill the error won't occur and mpirun will
run.
So is the problem resolved. Can anybody give me an hint?
I am using a amd64 linux (suse10) cluster with infiniband conection
and
openmpi-1.2a1r10111.
I attach the ompi_info --param all all output, hope it helps.
Regards Jens
_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems