I'm afraid I don't understand your comment about "another mpi process". Looking at your output, it would appear that there is something going on with host nexus17. In both cases, mpirun is launching a single daemon onto only one other node - the only difference was in the node being used. The "no_tree_spawn" flag did nothing as that only applies when there are multiple nodes being used.
I would check to see if there is a firewall between nexus10 and nexus17. You can also add -mca oob_base_verbose 10 to your cmd line and see if the daemon on nexus17 is able to connect back to mpirun., and add --debug-daemons to see any error messages that daemon may be trying to report. On Jul 15, 2014, at 3:08 AM, Ricardo Fernández-Perea <rfernandezpe...@gmail.com> wrote: > I have try if another mpi process is running in the node already the process > run > > $ricardo$ /opt/openmpi/bin/mpirun --mca plm_rsh_no_tree_spawn 1 -mca > plm_base_verbose 10 -host nexus16 ompi_info > [nexus10.nlroc:27397] mca: base: components_register: registering plm > components > [nexus10.nlroc:27397] mca: base: components_register: found loaded component > isolated > [nexus10.nlroc:27397] mca: base: components_register: component isolated has > no register or open function > [nexus10.nlroc:27397] mca: base: components_register: found loaded component > rsh > [nexus10.nlroc:27397] mca: base: components_register: component rsh register > function successful > [nexus10.nlroc:27397] mca: base: components_register: found loaded component > slurm > [nexus10.nlroc:27397] mca: base: components_register: component slurm > register function successful > [nexus10.nlroc:27397] mca: base: components_open: opening plm components > [nexus10.nlroc:27397] mca: base: components_open: found loaded component > isolated > [nexus10.nlroc:27397] mca: base: components_open: component isolated open > function successful > [nexus10.nlroc:27397] mca: base: components_open: found loaded component rsh > [nexus10.nlroc:27397] mca: base: components_open: component rsh open function > successful > [nexus10.nlroc:27397] mca: base: components_open: found loaded component slurm > [nexus10.nlroc:27397] mca: base: components_open: component slurm open > function successful > [nexus10.nlroc:27397] mca:base:select: Auto-selecting plm components > [nexus10.nlroc:27397] mca:base:select:( plm) Querying component [isolated] > [nexus10.nlroc:27397] mca:base:select:( plm) Query of component [isolated] > set priority to 0 > [nexus10.nlroc:27397] mca:base:select:( plm) Querying component [rsh] > [nexus10.nlroc:27397] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [nexus10.nlroc:27397] mca:base:select:( plm) Querying component [slurm] > [nexus10.nlroc:27397] mca:base:select:( plm) Skipping component [slurm]. > Query failed to return a module > [nexus10.nlroc:27397] mca:base:select:( plm) Selected component [rsh] > [nexus10.nlroc:27397] mca: base: close: component isolated closed > [nexus10.nlroc:27397] mca: base: close: unloading component isolated > [nexus10.nlroc:27397] mca: base: close: component slurm closed > [nexus10.nlroc:27397] mca: base: close: unloading component slurm > [nexus10.nlroc:27397] [[52326,0],0] plm:base:receive update proc state > command from [[52326,0],1] > [nexus10.nlroc:27397] [[52326,0],0] plm:base:receive got update_proc_state > for job [52326,1] > [nexus16.nlroc:59687] mca: base: components_register: registering plm > components > [nexus16.nlroc:59687] mca: base: components_register: found loaded component > isolated > [nexus16.nlroc:59687] mca: base: components_register: component isolated has > no register or open function > [nexus16.nlroc:59687] mca: base: components_register: found loaded component > rsh > [nexus16.nlroc:59687] mca: base: components_register: component rsh register > function successful > [nexus16.nlroc:59687] mca: base: components_register: found loaded component > slurm > [nexus16.nlroc:59687] mca: base: components_register: component slurm > register function successful > Package: Open MPI XXXX@nexus10.nlroc Distribution > Open MPI: 1.8.1 > Open MPI repo revision: r31483 > Open MPI release date: Apr 22, 2014 > Open RTE: 1.8.1 > … > > but if the compute node has not a mpi process running in it it already hangs > as > > /opt/openmpi/bin/mpirun --mca plm_rsh_no_tree_spawn 1 -mca plm_base_verbose > 10 -host nexus17 ompi_info > [nexus10.nlroc:27438] mca: base: components_register: registering plm > components > [nexus10.nlroc:27438] mca: base: components_register: found loaded component > isolated > [nexus10.nlroc:27438] mca: base: components_register: component isolated has > no register or open function > [nexus10.nlroc:27438] mca: base: components_register: found loaded component > rsh > [nexus10.nlroc:27438] mca: base: components_register: component rsh register > function successful > [nexus10.nlroc:27438] mca: base: components_register: found loaded component > slurm > [nexus10.nlroc:27438] mca: base: components_register: component slurm > register function successful > [nexus10.nlroc:27438] mca: base: components_open: opening plm components > [nexus10.nlroc:27438] mca: base: components_open: found loaded component > isolated > [nexus10.nlroc:27438] mca: base: components_open: component isolated open > function successful > [nexus10.nlroc:27438] mca: base: components_open: found loaded component rsh > [nexus10.nlroc:27438] mca: base: components_open: component rsh open function > successful > [nexus10.nlroc:27438] mca: base: components_open: found loaded component slurm > [nexus10.nlroc:27438] mca: base: components_open: component slurm open > function successful > [nexus10.nlroc:27438] mca:base:select: Auto-selecting plm components > [nexus10.nlroc:27438] mca:base:select:( plm) Querying component [isolated] > [nexus10.nlroc:27438] mca:base:select:( plm) Query of component [isolated] > set priority to 0 > [nexus10.nlroc:27438] mca:base:select:( plm) Querying component [rsh] > [nexus10.nlroc:27438] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [nexus10.nlroc:27438] mca:base:select:( plm) Querying component [slurm] > [nexus10.nlroc:27438] mca:base:select:( plm) Skipping component [slurm]. > Query failed to return a module > [nexus10.nlroc:27438] mca:base:select:( plm) Selected component [rsh] > [nexus10.nlroc:27438] mca: base: close: component isolated closed > [nexus10.nlroc:27438] mca: base: close: unloading component isolated > [nexus10.nlroc:27438] mca: base: close: component slurm closed > [nexus10.nlroc:27438] mca: base: close: unloading component slurm > > and it stop there > > > > > On Mon, Jul 14, 2014 at 8:56 PM, Ralph Castain <r...@open-mpi.org> wrote: > Hmmm...no, it worked just fine for me. It sounds like something else is going > on. > > Try configuring OMPI with --enable-debug, and then add -mca plm_base_verbose > 10 to get a better sense of what is going on. > > > On Jul 14, 2014, at 10:27 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> I confess I haven't tested no_tree_spawn in ages, so it is quite possible it >> has suffered bit rot. I can try to take a look at it in a bit >> >> >> On Jul 14, 2014, at 10:13 AM, Ricardo Fernández-Perea >> <rfernandezpe...@gmail.com> wrote: >> >>> Thank you for the fast answer >>> >>> While that resolve my problem with cross ssh authentication a command as >>> >>> /opt/openmpi/bin/mpirun --mca mtl mx --mca pml cm --mca >>> plm_rsh_no_tree_spawn 1 -hostfile hostfile ompi_info >>> >>> just hung with no output and although there is a ssh connexion no orte >>> program is initiated in the destination nodes >>> >>> and while >>> >>> /opt/openmpi/bin/mpirun -host host18 ompi_info >>> >>> works >>> >>> /opt/openmpi/bin/mpirun --mca plm_rsh_no_tree_spawn 1 -host host18 >>> ompi_info >>> >>> hangs, is there some condition in the use of this parameter. >>> >>> Yours truly >>> >>> Ricardo >>> >>> >>> >>> On Mon, Jul 14, 2014 at 6:35 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> During the 1.7 series and for all follow-on series, OMPI changed to a mode >>> where it launches a daemon on all allocated nodes at the startup of mpirun. >>> This allows us to determine the hardware topology of the nodes and take >>> that into account when mapping. You can override that behavior by either >>> adding --novm to your cmd line (which will impact your mapping/binding >>> options), or by specifying the hosts to use by editing your hostfile, or >>> adding --host host1,host2 to your cmd line >>> >>> The rsh launcher defaults to a tree-based pattern, thus requiring that we >>> be able to ssh from one compute node to another. You can change that to a >>> less scalable direct mode by adding >>> >>> --mca plm_rsh_no_tree_spawn 1 >>> >>> to the cmd line >>> >>> >>> On Jul 14, 2014, at 9:21 AM, Ricardo Fernández-Perea >>> <rfernandezpe...@gmail.com> wrote: >>> >>> > I'm trying to update to openMPI 1.8.1 thru ssh and Myrinet >>> > >>> > running a command as >>> > >>> > /opt/openmpi/bin/mpirun --verbose --mca mtl mx --mca pml cm -hostfile >>> > hostfile -np 16 >>> > >>> > when the hostfile contain only two nodes as >>> > >>> > host1 slots=8 max-slots=8 >>> > host2 slots=8 max-slots=8 >>> > >>> > it runs perfectly but when the hostfile has a third node as >>> > >>> > >>> > host1 slots=8 max-slots=8 >>> > host2 slots=8 max-slots=8 >>> > host3 slots=8 max-slots=8 >>> > >>> > it try to establish an ssh connection between the running hosts1 and >>> > host3 that should not run any process that fails hanging the process >>> > without signaling. >>> > >>> > >>> > my ompi_info is as follow >>> > >>> > Package: Open MPI XXX Distribution >>> > Open MPI: 1.8.1 >>> > Open MPI repo revision: r31483 >>> > Open MPI release date: Apr 22, 2014 >>> > Open RTE: 1.8.1 >>> > Open RTE repo revision: r31483 >>> > Open RTE release date: Apr 22, 2014 >>> > OPAL: 1.8.1 >>> > OPAL repo revision: r31483 >>> > OPAL release date: Apr 22, 2014 >>> > MPI API: 3.0 >>> > Ident string: 1.8.1 >>> > Prefix: /opt/openmpi >>> > Configured architecture: x86_64-apple-darwin9.8.0 >>> > Configure host: XXXX >>> > Configured by: XXXX >>> > Configured on: Thu Jun 12 10:37:33 CEST 2014 >>> > Configure host: XXXX >>> > Built by: XXXX >>> > Built on: Thu Jun 12 11:13:16 CEST 2014 >>> > Built host: XXXX >>> > C bindings: yes >>> > C++ bindings: yes >>> > Fort mpif.h: yes (single underscore) >>> > Fort use mpi: yes (full: ignore TKR) >>> > Fort use mpi size: deprecated-ompi-info-value >>> > Fort use mpi_f08: yes >>> > Fort mpi_f08 compliance: The mpi_f08 module is available, but due to >>> > limitations in the ifort compiler, does not >>> > support >>> > the following: array subsections, direct >>> > passthru >>> > (where possible) to underlying Open MPI's C >>> > functionality >>> > Fort mpi_f08 subarrays: no >>> > Java bindings: no >>> > Wrapper compiler rpath: unnecessary >>> > C compiler: icc >>> > C compiler absolute: /opt/intel/Compiler/11.1/080/bin/intel64/icc >>> > C compiler family name: INTEL >>> > C compiler version: 1110.20091130 >>> > C++ compiler: icpc >>> > C++ compiler absolute: /opt/intel/Compiler/11.1/080/bin/intel64/icpc >>> > Fort compiler: ifort >>> > Fort compiler abs: /opt/intel/Compiler/11.1/080/bin/intel64/ifort >>> > Fort ignore TKR: yes (!DEC$ ATTRIBUTES NO_ARG_CHECK ::) >>> > Fort 08 assumed shape: no >>> > Fort optional args: yes >>> > Fort BIND(C) (all): yes >>> > Fort ISO_C_BINDING: yes >>> > Fort SUBROUTINE BIND(C): yes >>> > Fort TYPE,BIND(C): yes >>> > Fort T,BIND(C,name="a"): yes >>> > Fort PRIVATE: yes >>> > Fort PROTECTED: yes >>> > Fort ABSTRACT: yes >>> > Fort ASYNCHRONOUS: yes >>> > Fort PROCEDURE: yes >>> > Fort f08 using wrappers: yes >>> > C profiling: yes >>> > C++ profiling: yes >>> > Fort mpif.h profiling: yes >>> > Fort use mpi profiling: yes >>> > Fort use mpi_f08 prof: yes >>> > C++ exceptions: no >>> > Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: >>> > yes, >>> > OMPI progress: no, ORTE progress: yes, Event >>> > lib: >>> > yes) >>> > Sparse Groups: no >>> > Internal debug support: no >>> > MPI interface warnings: yes >>> > MPI parameter check: runtime >>> > Memory profiling support: no >>> > Memory debugging support: no >>> > libltdl support: yes >>> > Heterogeneous support: no >>> > mpirun default --prefix: no >>> > MPI I/O support: yes >>> > MPI_WTIME support: gettimeofday >>> > Symbol vis. support: yes >>> > Host topology support: yes >>> > MPI extensions: >>> > FT Checkpoint support: no (checkpoint thread: no) >>> > C/R Enabled Debugging: no >>> > VampirTrace support: yes >>> > MPI_MAX_PROCESSOR_NAME: 256 >>> > MPI_MAX_ERROR_STRING: 256 >>> > MPI_MAX_OBJECT_NAME: 64 >>> > MPI_MAX_INFO_KEY: 36 >>> > MPI_MAX_INFO_VAL: 256 >>> > MPI_MAX_PORT_NAME: 1024 >>> > MPI_MAX_DATAREP_STRING: 128 >>> > >>> > >>> > _______________________________________________ >>> > users mailing list >>> > us...@open-mpi.org >>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> > Link to this post: >>> > http://www.open-mpi.org/community/lists/users/2014/07/24764.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/07/24765.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/07/24766.php >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/07/24768.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/07/24769.php