Re: [OMPI users] deadlock on barrier

2007-03-24 Thread Jeff Squyres

On Mar 22, 2007, at 12:18 PM, tim gunter wrote:


On 3/22/07, Jeff Squyres  wrote:
Is this a TCP-based cluster?

yes

If so, do you have multiple IP addresses on your frontend machine?
Check out these two FAQ entries to see if they help:

http://www.open-mpi.org/faq/?category=tcp#tcp-routability
http://www.open-mpi.org/faq/?category=tcp#tcp-selection

ok, using the internal interfaces only fixed the problem.


Per the tcp-routability FAQ entry, were our heuristics incorrect  
about deciding which IP address(es) to connect to?  E.g., did all  
your interfaces have public IP addresses, or something else that went  
against our assumptions?


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Failure to launch on a remote node. SSH problem?

2007-03-24 Thread Jeff Squyres

On Mar 23, 2007, at 1:58 PM, Walker, David T. wrote:


I am presently trying to get OpenMPI up and running on a small cluster
of MacPros (dual dual-core Xeons) using TCP. Opne MPI was compiled  
using
the intel Fortran Compiler (9.1) and gcc.  When I try to launch a  
job on
a remote node, orted starts on the remote node but then times out.   
I am

guessing that the problem is SSH related.  Any thoughts?


When I hear scenarios like this, the first thought that comes into my  
head is: firewall issues.  Open MPI requires the ability to open  
random TCP ports from all hosts used in the MPI job.  Have you  
disabled the firewalls between all machines, or specifically allowed  
those machines to open random TCP sockets between each other?



node01 1246% mpirun --debug-daemons -hostfile machinefile -np 5
Hello_World_Fortran
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
[node03:02422] [0,0,1]-[0,0,0] mca_oob_tcp_peer_send_blocking: send()
failed with errno=57


This error message supports the above theory (that there is a  
firewall / port blocking software package in the way) -- the remote  
daemon tried to open a socket back to mpirun and failed.



[node01.local:21427] ERROR: A daemon on node node03 failed to start as
expected.


This is [effectively] mpirun reporting that it timed out while  
waiting for the remote orted to call back and say "I'm here!".


Check your firewall / port blocking settings and see if disabling /  
selectively allowing trust between your machines solves the issue.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] install error

2007-03-24 Thread Jeff Squyres

Dan --

Your attachments didn't seem to make it through.  Could you re-send?

Thanks


On Mar 23, 2007, at 7:01 PM, Dan Dansereau wrote:


To ALL

I am getting the following error while attempting to install  
openmpi on a Linux


System – as follows

Linux utahwtm.hydropoint.com 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23  
13:38:27 BST 2006 x86_64 x86_64 x86_64 GNU/Linux




with the  lntel compilers that are the latest versions of 9.1



this is the ERROR



libtool: link: icc -O3 -DNDEBUG -finline-functions -fno-strict- 
aliasing -restrict -pthread -o opal_wrapper opal_wrapper.o -Wl,-- 
export-dynamic  -pthread ../../../opal/.libs/libopen-pal.a -lnsl - 
lutil


../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text 
+0x1d): In function `munmap':


: undefined reference to `__munmap'

../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text 
+0x52): In function `opal_mem_free_ptmalloc2_munmap':


: undefined reference to `__munmap'

../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text 
+0x66): In function `mmap':


: undefined reference to `__mmap'

../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text 
+0x8d): In function `opal_mem_free_ptmalloc2_mmap':


: undefined reference to `__mmap'

make[2]: *** [opal_wrapper] Error 1

make[2]: Leaving directory `/home/dad/model/openmpi-1.2/opal/tools/ 
wrappers'


make[1]: *** [all-recursive] Error 1

make[1]: Leaving directory `/home/dad/model/openmpi-1.2/opal'

make: *** [all-recursive] Error 1



the config command was

./configure CC=icc CXX=icpc F77=ifort FC=ifort --disable-shared -- 
enable-static --prefix=/model/OPENMP_I




and executed with no errors



I have attached both the config.log and the compile.log



Any help or direction would greatly be appreciated.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems




Re: [OMPI users] Problems compiling openmpi 1.2 under AIX 5.2

2007-03-24 Thread Jeff Squyres

On Mar 23, 2007, at 10:53 AM, Ricardo Fonseca wrote:

I'm having problems compiling openmpi 1.2 under AIX 5.2. Here are  
the configure parameters:


./configure  --disable-shared --enable-static \
   CC=xlc CXX=xlc++ F77=xlf FC=xlf95

To get it to work I have to do 2 changes:

diff -r openmpi-1.2/ompi/mpi/cxx/mpicxx.cc openmpi-1.2-aix/ompi/mpi/ 
cxx/mpicxx.cc

34a35,38
> #undef SEEK_SET
> #undef SEEK_CUR
> #undef SEEK_END
>


Hrm.  This one is problematic in the general case; there's very weird  
mojo involved with the MPI C++ bindings to get MPI::SEEK_SET and  
friends to work properly.


Our support for MPI::SEEK_* isn't 100% correct yet (you can't use  
them in switch/case statements -- it's a long, convoluted story), so  
I don't think that I want to apply this patch to the general code base.


diff -r openmpi-1.2/orte/mca/pls/poe/pls_poe_module.c openmpi-1.2- 
aix/orte/mca/pls/poe/pls_poe_module.c

636a637,641
> static int pls_poe_cancel_operation(void)
> {
> return ORTE_ERR_NOT_IMPLEMENTED;
> }

This last one means that when you run OpenMPI jobs through POE you  
get a:


[r1blade003:381130] [0,0,0] ORTE_ERROR_LOG: Not implemented in file  
errmgr_hnp.c at line 90
-- 

mpirun was unable to cleanly terminate the daemons for this job.  
Returned value Not implemented instead of ORTE_SUCCESS.


at the job end.


Gotcha.

How about this compromise: since AIX is not an officially supported  
platform, I've added a category in the FAQ about AIX and put a link  
to this mail so that AIX users can manually apply these changes to  
get a working Open MPI.


(Our web server seems to be having some problems right now; I'll get  
the category added to the live site shortly)


Thanks for the patches!

--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Failure to launch on a remote node. SSH problem?

2007-03-24 Thread David Burns
Could also be a firewall problem. Make sure all nodes in the cluster 
accept tcp packets from all others.


Dave

Walker, David T. wrote:

I am presently trying to get OpenMPI up and running on a small cluster
of MacPros (dual dual-core Xeons) using TCP. Opne MPI was compiled using
the intel Fortran Compiler (9.1) and gcc.  When I try to launch a job on
a remote node, orted starts on the remote node but then times out.  I am
guessing that the problem is SSH related.  Any thoughts?

Thanks,

Dave

Details:  


I am using SSH, set up as outlined in the FAQ, using ssh-agent to allow
passwordless logins.  The paths for all the libraries appear to be OK.  


A simple MPI code (Hello_World_Fortran) launched on node01 will run OK
for up to four processors (all on node01).  The output is shown here.

node01 1247% mpirun --debug-daemons -hostfile machinefile -np 4
Hello_World_Fortran
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
Fortran version of Hello World, rank2
Rank 0 is present in Fortran version of Hello World.
Fortran version of Hello World, rank3
Fortran version of Hello World, rank1

For five processors mpirun tries to start an additional process on
node03.  Everything launches the same on node01 (four instances of
Hello_World_Fortran are launched).  On node03, orted starts, but times
out after 10 seconds and the output below is generated.   


node01 1246% mpirun --debug-daemons -hostfile machinefile -np 5
Hello_World_Fortran
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
[node03:02422] [0,0,1]-[0,0,0] mca_oob_tcp_peer_send_blocking: send()
failed with errno=57
[node01.local:21427] ERROR: A daemon on node node03 failed to start as
expected.
[node01.local:21427] ERROR: There may be more information available from
[node01.local:21427] ERROR: the remote shell (see above).
[node01.local:21427] ERROR: The daemon exited unexpectedly with status
255.
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)

Here is the ompi info:


node01 1248% ompi_info --all
Open MPI: 1.1.2
   Open MPI SVN revision: r12073
Open RTE: 1.1.2
   Open RTE SVN revision: r12073
OPAL: 1.1.2
   OPAL SVN revision: r12073
  MCA memory: darwin (MCA v1.0, API v1.0, Component v1.1.2)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component
v1.1.2)
   MCA timer: darwin (MCA v1.0, API v1.0, Component v1.1.2)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2)
MCA coll: hierarch (MCA v1.0, API v1.0, Component
v1.1.2)
MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2)
  MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2)
   MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2)
 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2)
  MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2)
 MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2)
 MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2)
 MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2)
 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
 MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2)
 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2)
 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2)
 MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2)
  MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2)
  MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2)
 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
 MCA ras: dash_host (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA ras: hostfile (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA ras: localhost (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA ras: xgrid (MCA v1.0, API v1.0, Component v1.1.2)
 MCA rds: hostfile (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2)
   MCA rmaps: round_robin (MCA v1.0, API v1.0, Component
v1.1.2)
MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1.2)
 MCA rml: oob (MCA v1.0, API v1.0, Component v1.1.2)

Re: [OMPI users] Failure to launch on a remote node. SSH problem?

2007-03-24 Thread Mike Houston
Also make sure that /tmp is user writable.  By default, that is where 
openmpi likes to stick some files.


-Mike

David Burns wrote:
Could also be a firewall problem. Make sure all nodes in the cluster 
accept tcp packets from all others.


Dave

Walker, David T. wrote:
  

I am presently trying to get OpenMPI up and running on a small cluster
of MacPros (dual dual-core Xeons) using TCP. Opne MPI was compiled using
the intel Fortran Compiler (9.1) and gcc.  When I try to launch a job on
a remote node, orted starts on the remote node but then times out.  I am
guessing that the problem is SSH related.  Any thoughts?

Thanks,

Dave

Details:  


I am using SSH, set up as outlined in the FAQ, using ssh-agent to allow
passwordless logins.  The paths for all the libraries appear to be OK.  


A simple MPI code (Hello_World_Fortran) launched on node01 will run OK
for up to four processors (all on node01).  The output is shown here.

node01 1247% mpirun --debug-daemons -hostfile machinefile -np 4
Hello_World_Fortran
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
Fortran version of Hello World, rank2
Rank 0 is present in Fortran version of Hello World.
Fortran version of Hello World, rank3
Fortran version of Hello World, rank1

For five processors mpirun tries to start an additional process on
node03.  Everything launches the same on node01 (four instances of
Hello_World_Fortran are launched).  On node03, orted starts, but times
out after 10 seconds and the output below is generated.   


node01 1246% mpirun --debug-daemons -hostfile machinefile -np 5
Hello_World_Fortran
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
[node03:02422] [0,0,1]-[0,0,0] mca_oob_tcp_peer_send_blocking: send()
failed with errno=57
[node01.local:21427] ERROR: A daemon on node node03 failed to start as
expected.
[node01.local:21427] ERROR: There may be more information available from
[node01.local:21427] ERROR: the remote shell (see above).
[node01.local:21427] ERROR: The daemon exited unexpectedly with status
255.
forrtl: error (78): process killed (SIGTERM)
forrtl: error (78): process killed (SIGTERM)

Here is the ompi info:


node01 1248% ompi_info --all
Open MPI: 1.1.2
   Open MPI SVN revision: r12073
Open RTE: 1.1.2
   Open RTE SVN revision: r12073
OPAL: 1.1.2
   OPAL SVN revision: r12073
  MCA memory: darwin (MCA v1.0, API v1.0, Component v1.1.2)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component
v1.1.2)
   MCA timer: darwin (MCA v1.0, API v1.0, Component v1.1.2)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2)
MCA coll: hierarch (MCA v1.0, API v1.0, Component
v1.1.2)
MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2)
  MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2)
   MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2)
 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2)
  MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2)
 MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2)
 MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2)
 MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2)
 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
 MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2)
 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2)
 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2)
 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2)
 MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2)
  MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2)
  MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2)
 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
 MCA ras: dash_host (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA ras: hostfile (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA ras: localhost (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA ras: xgrid (MCA v1.0, API v1.0, Component v1.1.2)
 MCA rds: hostfile (MCA v1.0, API v1.0, Component
v1.1.2)
 MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2)
   MCA rmaps: round_robin (MCA v1.0, API v1.0, Component
v1.1.2)
MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2)