Re: [OMPI users] deadlock on barrier
On Mar 22, 2007, at 12:18 PM, tim gunter wrote: On 3/22/07, Jeff Squyres wrote: Is this a TCP-based cluster? yes If so, do you have multiple IP addresses on your frontend machine? Check out these two FAQ entries to see if they help: http://www.open-mpi.org/faq/?category=tcp#tcp-routability http://www.open-mpi.org/faq/?category=tcp#tcp-selection ok, using the internal interfaces only fixed the problem. Per the tcp-routability FAQ entry, were our heuristics incorrect about deciding which IP address(es) to connect to? E.g., did all your interfaces have public IP addresses, or something else that went against our assumptions? -- Jeff Squyres Cisco Systems
Re: [OMPI users] Failure to launch on a remote node. SSH problem?
On Mar 23, 2007, at 1:58 PM, Walker, David T. wrote: I am presently trying to get OpenMPI up and running on a small cluster of MacPros (dual dual-core Xeons) using TCP. Opne MPI was compiled using the intel Fortran Compiler (9.1) and gcc. When I try to launch a job on a remote node, orted starts on the remote node but then times out. I am guessing that the problem is SSH related. Any thoughts? When I hear scenarios like this, the first thought that comes into my head is: firewall issues. Open MPI requires the ability to open random TCP ports from all hosts used in the MPI job. Have you disabled the firewalls between all machines, or specifically allowed those machines to open random TCP sockets between each other? node01 1246% mpirun --debug-daemons -hostfile machinefile -np 5 Hello_World_Fortran Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT [node03:02422] [0,0,1]-[0,0,0] mca_oob_tcp_peer_send_blocking: send() failed with errno=57 This error message supports the above theory (that there is a firewall / port blocking software package in the way) -- the remote daemon tried to open a socket back to mpirun and failed. [node01.local:21427] ERROR: A daemon on node node03 failed to start as expected. This is [effectively] mpirun reporting that it timed out while waiting for the remote orted to call back and say "I'm here!". Check your firewall / port blocking settings and see if disabling / selectively allowing trust between your machines solves the issue. -- Jeff Squyres Cisco Systems
Re: [OMPI users] install error
Dan -- Your attachments didn't seem to make it through. Could you re-send? Thanks On Mar 23, 2007, at 7:01 PM, Dan Dansereau wrote: To ALL I am getting the following error while attempting to install openmpi on a Linux System – as follows Linux utahwtm.hydropoint.com 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST 2006 x86_64 x86_64 x86_64 GNU/Linux with the lntel compilers that are the latest versions of 9.1 this is the ERROR libtool: link: icc -O3 -DNDEBUG -finline-functions -fno-strict- aliasing -restrict -pthread -o opal_wrapper opal_wrapper.o -Wl,-- export-dynamic -pthread ../../../opal/.libs/libopen-pal.a -lnsl - lutil ../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text +0x1d): In function `munmap': : undefined reference to `__munmap' ../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text +0x52): In function `opal_mem_free_ptmalloc2_munmap': : undefined reference to `__munmap' ../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text +0x66): In function `mmap': : undefined reference to `__mmap' ../../../opal/.libs/libopen-pal.a(opal_ptmalloc2_munmap.o)(.text +0x8d): In function `opal_mem_free_ptmalloc2_mmap': : undefined reference to `__mmap' make[2]: *** [opal_wrapper] Error 1 make[2]: Leaving directory `/home/dad/model/openmpi-1.2/opal/tools/ wrappers' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/dad/model/openmpi-1.2/opal' make: *** [all-recursive] Error 1 the config command was ./configure CC=icc CXX=icpc F77=ifort FC=ifort --disable-shared -- enable-static --prefix=/model/OPENMP_I and executed with no errors I have attached both the config.log and the compile.log Any help or direction would greatly be appreciated. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Problems compiling openmpi 1.2 under AIX 5.2
On Mar 23, 2007, at 10:53 AM, Ricardo Fonseca wrote: I'm having problems compiling openmpi 1.2 under AIX 5.2. Here are the configure parameters: ./configure --disable-shared --enable-static \ CC=xlc CXX=xlc++ F77=xlf FC=xlf95 To get it to work I have to do 2 changes: diff -r openmpi-1.2/ompi/mpi/cxx/mpicxx.cc openmpi-1.2-aix/ompi/mpi/ cxx/mpicxx.cc 34a35,38 > #undef SEEK_SET > #undef SEEK_CUR > #undef SEEK_END > Hrm. This one is problematic in the general case; there's very weird mojo involved with the MPI C++ bindings to get MPI::SEEK_SET and friends to work properly. Our support for MPI::SEEK_* isn't 100% correct yet (you can't use them in switch/case statements -- it's a long, convoluted story), so I don't think that I want to apply this patch to the general code base. diff -r openmpi-1.2/orte/mca/pls/poe/pls_poe_module.c openmpi-1.2- aix/orte/mca/pls/poe/pls_poe_module.c 636a637,641 > static int pls_poe_cancel_operation(void) > { > return ORTE_ERR_NOT_IMPLEMENTED; > } This last one means that when you run OpenMPI jobs through POE you get a: [r1blade003:381130] [0,0,0] ORTE_ERROR_LOG: Not implemented in file errmgr_hnp.c at line 90 -- mpirun was unable to cleanly terminate the daemons for this job. Returned value Not implemented instead of ORTE_SUCCESS. at the job end. Gotcha. How about this compromise: since AIX is not an officially supported platform, I've added a category in the FAQ about AIX and put a link to this mail so that AIX users can manually apply these changes to get a working Open MPI. (Our web server seems to be having some problems right now; I'll get the category added to the live site shortly) Thanks for the patches! -- Jeff Squyres Cisco Systems
Re: [OMPI users] Failure to launch on a remote node. SSH problem?
Could also be a firewall problem. Make sure all nodes in the cluster accept tcp packets from all others. Dave Walker, David T. wrote: I am presently trying to get OpenMPI up and running on a small cluster of MacPros (dual dual-core Xeons) using TCP. Opne MPI was compiled using the intel Fortran Compiler (9.1) and gcc. When I try to launch a job on a remote node, orted starts on the remote node but then times out. I am guessing that the problem is SSH related. Any thoughts? Thanks, Dave Details: I am using SSH, set up as outlined in the FAQ, using ssh-agent to allow passwordless logins. The paths for all the libraries appear to be OK. A simple MPI code (Hello_World_Fortran) launched on node01 will run OK for up to four processors (all on node01). The output is shown here. node01 1247% mpirun --debug-daemons -hostfile machinefile -np 4 Hello_World_Fortran Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT Fortran version of Hello World, rank2 Rank 0 is present in Fortran version of Hello World. Fortran version of Hello World, rank3 Fortran version of Hello World, rank1 For five processors mpirun tries to start an additional process on node03. Everything launches the same on node01 (four instances of Hello_World_Fortran are launched). On node03, orted starts, but times out after 10 seconds and the output below is generated. node01 1246% mpirun --debug-daemons -hostfile machinefile -np 5 Hello_World_Fortran Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT [node03:02422] [0,0,1]-[0,0,0] mca_oob_tcp_peer_send_blocking: send() failed with errno=57 [node01.local:21427] ERROR: A daemon on node node03 failed to start as expected. [node01.local:21427] ERROR: There may be more information available from [node01.local:21427] ERROR: the remote shell (see above). [node01.local:21427] ERROR: The daemon exited unexpectedly with status 255. forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) Here is the ompi info: node01 1248% ompi_info --all Open MPI: 1.1.2 Open MPI SVN revision: r12073 Open RTE: 1.1.2 Open RTE SVN revision: r12073 OPAL: 1.1.2 OPAL SVN revision: r12073 MCA memory: darwin (MCA v1.0, API v1.0, Component v1.1.2) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.2) MCA timer: darwin (MCA v1.0, API v1.0, Component v1.1.2) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2) MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0) MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2) MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: xgrid (MCA v1.0, API v1.0, Component v1.1.2) MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1.2) MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2) MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1.2) MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1.2) MCA rml: oob (MCA v1.0, API v1.0, Component v1.1.2)
Re: [OMPI users] Failure to launch on a remote node. SSH problem?
Also make sure that /tmp is user writable. By default, that is where openmpi likes to stick some files. -Mike David Burns wrote: Could also be a firewall problem. Make sure all nodes in the cluster accept tcp packets from all others. Dave Walker, David T. wrote: I am presently trying to get OpenMPI up and running on a small cluster of MacPros (dual dual-core Xeons) using TCP. Opne MPI was compiled using the intel Fortran Compiler (9.1) and gcc. When I try to launch a job on a remote node, orted starts on the remote node but then times out. I am guessing that the problem is SSH related. Any thoughts? Thanks, Dave Details: I am using SSH, set up as outlined in the FAQ, using ssh-agent to allow passwordless logins. The paths for all the libraries appear to be OK. A simple MPI code (Hello_World_Fortran) launched on node01 will run OK for up to four processors (all on node01). The output is shown here. node01 1247% mpirun --debug-daemons -hostfile machinefile -np 4 Hello_World_Fortran Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT Fortran version of Hello World, rank2 Rank 0 is present in Fortran version of Hello World. Fortran version of Hello World, rank3 Fortran version of Hello World, rank1 For five processors mpirun tries to start an additional process on node03. Everything launches the same on node01 (four instances of Hello_World_Fortran are launched). On node03, orted starts, but times out after 10 seconds and the output below is generated. node01 1246% mpirun --debug-daemons -hostfile machinefile -np 5 Hello_World_Fortran Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT Calling MPI_INIT [node03:02422] [0,0,1]-[0,0,0] mca_oob_tcp_peer_send_blocking: send() failed with errno=57 [node01.local:21427] ERROR: A daemon on node node03 failed to start as expected. [node01.local:21427] ERROR: There may be more information available from [node01.local:21427] ERROR: the remote shell (see above). [node01.local:21427] ERROR: The daemon exited unexpectedly with status 255. forrtl: error (78): process killed (SIGTERM) forrtl: error (78): process killed (SIGTERM) Here is the ompi info: node01 1248% ompi_info --all Open MPI: 1.1.2 Open MPI SVN revision: r12073 Open RTE: 1.1.2 Open RTE SVN revision: r12073 OPAL: 1.1.2 OPAL SVN revision: r12073 MCA memory: darwin (MCA v1.0, API v1.0, Component v1.1.2) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.2) MCA timer: darwin (MCA v1.0, API v1.0, Component v1.1.2) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: self (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.2) MCA io: romio (MCA v1.0, API v1.0, Component v1.1.2) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.2) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.2) MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: self (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.2) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.2) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0) MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.2) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.2) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.2) MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1.2) MCA ns: replica (MCA v1.0, API v1.0, Component v1.1.2) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1.2) MCA ras: xgrid (MCA v1.0, API v1.0, Component v1.1.2) MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1.2) MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1.2) MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1.2) MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1.2)