I'm afraid I don't understand your comment about "another mpi process". Looking 
at your output, it would appear that there is something going on with host 
nexus17. In both cases, mpirun is launching a single daemon onto only one other 
node - the only difference was in the node being used. The "no_tree_spawn" flag 
did nothing as that only applies when there are multiple nodes being used.

I would check to see if there is a firewall between nexus10 and nexus17. You 
can also add -mca oob_base_verbose 10 to your cmd line and see if the daemon on 
nexus17 is able to connect back to mpirun., and add --debug-daemons to see any 
error messages that daemon may be trying to report.


On Jul 15, 2014, at 3:08 AM, Ricardo Fernández-Perea 
<rfernandezpe...@gmail.com> wrote:

> I have try if another mpi process is running in the node already the process 
> run 
> 
> $ricardo$ /opt/openmpi/bin/mpirun  --mca plm_rsh_no_tree_spawn 1 -mca 
> plm_base_verbose 10 -host nexus16 ompi_info
> [nexus10.nlroc:27397] mca: base: components_register: registering plm 
> components
> [nexus10.nlroc:27397] mca: base: components_register: found loaded component 
> isolated
> [nexus10.nlroc:27397] mca: base: components_register: component isolated has 
> no register or open function
> [nexus10.nlroc:27397] mca: base: components_register: found loaded component 
> rsh
> [nexus10.nlroc:27397] mca: base: components_register: component rsh register 
> function successful
> [nexus10.nlroc:27397] mca: base: components_register: found loaded component 
> slurm
> [nexus10.nlroc:27397] mca: base: components_register: component slurm 
> register function successful
> [nexus10.nlroc:27397] mca: base: components_open: opening plm components
> [nexus10.nlroc:27397] mca: base: components_open: found loaded component 
> isolated
> [nexus10.nlroc:27397] mca: base: components_open: component isolated open 
> function successful
> [nexus10.nlroc:27397] mca: base: components_open: found loaded component rsh
> [nexus10.nlroc:27397] mca: base: components_open: component rsh open function 
> successful
> [nexus10.nlroc:27397] mca: base: components_open: found loaded component slurm
> [nexus10.nlroc:27397] mca: base: components_open: component slurm open 
> function successful
> [nexus10.nlroc:27397] mca:base:select: Auto-selecting plm components
> [nexus10.nlroc:27397] mca:base:select:(  plm) Querying component [isolated]
> [nexus10.nlroc:27397] mca:base:select:(  plm) Query of component [isolated] 
> set priority to 0
> [nexus10.nlroc:27397] mca:base:select:(  plm) Querying component [rsh]
> [nexus10.nlroc:27397] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [nexus10.nlroc:27397] mca:base:select:(  plm) Querying component [slurm]
> [nexus10.nlroc:27397] mca:base:select:(  plm) Skipping component [slurm]. 
> Query failed to return a module
> [nexus10.nlroc:27397] mca:base:select:(  plm) Selected component [rsh]
> [nexus10.nlroc:27397] mca: base: close: component isolated closed
> [nexus10.nlroc:27397] mca: base: close: unloading component isolated
> [nexus10.nlroc:27397] mca: base: close: component slurm closed
> [nexus10.nlroc:27397] mca: base: close: unloading component slurm
> [nexus10.nlroc:27397] [[52326,0],0] plm:base:receive update proc state 
> command from [[52326,0],1]
> [nexus10.nlroc:27397] [[52326,0],0] plm:base:receive got update_proc_state 
> for job [52326,1]
> [nexus16.nlroc:59687] mca: base: components_register: registering plm 
> components
> [nexus16.nlroc:59687] mca: base: components_register: found loaded component 
> isolated
> [nexus16.nlroc:59687] mca: base: components_register: component isolated has 
> no register or open function
> [nexus16.nlroc:59687] mca: base: components_register: found loaded component 
> rsh
> [nexus16.nlroc:59687] mca: base: components_register: component rsh register 
> function successful
> [nexus16.nlroc:59687] mca: base: components_register: found loaded component 
> slurm
> [nexus16.nlroc:59687] mca: base: components_register: component slurm 
> register function successful
>                  Package: Open MPI XXXX@nexus10.nlroc Distribution
>                 Open MPI: 1.8.1
>   Open MPI repo revision: r31483
>    Open MPI release date: Apr 22, 2014
>                 Open RTE: 1.8.1
> …
> 
> but if the compute node has not a mpi process running in it it already hangs 
> as
> 
> /opt/openmpi/bin/mpirun  --mca plm_rsh_no_tree_spawn 1 -mca plm_base_verbose 
> 10 -host nexus17 ompi_info
> [nexus10.nlroc:27438] mca: base: components_register: registering plm 
> components
> [nexus10.nlroc:27438] mca: base: components_register: found loaded component 
> isolated
> [nexus10.nlroc:27438] mca: base: components_register: component isolated has 
> no register or open function
> [nexus10.nlroc:27438] mca: base: components_register: found loaded component 
> rsh
> [nexus10.nlroc:27438] mca: base: components_register: component rsh register 
> function successful
> [nexus10.nlroc:27438] mca: base: components_register: found loaded component 
> slurm
> [nexus10.nlroc:27438] mca: base: components_register: component slurm 
> register function successful
> [nexus10.nlroc:27438] mca: base: components_open: opening plm components
> [nexus10.nlroc:27438] mca: base: components_open: found loaded component 
> isolated
> [nexus10.nlroc:27438] mca: base: components_open: component isolated open 
> function successful
> [nexus10.nlroc:27438] mca: base: components_open: found loaded component rsh
> [nexus10.nlroc:27438] mca: base: components_open: component rsh open function 
> successful
> [nexus10.nlroc:27438] mca: base: components_open: found loaded component slurm
> [nexus10.nlroc:27438] mca: base: components_open: component slurm open 
> function successful
> [nexus10.nlroc:27438] mca:base:select: Auto-selecting plm components
> [nexus10.nlroc:27438] mca:base:select:(  plm) Querying component [isolated]
> [nexus10.nlroc:27438] mca:base:select:(  plm) Query of component [isolated] 
> set priority to 0
> [nexus10.nlroc:27438] mca:base:select:(  plm) Querying component [rsh]
> [nexus10.nlroc:27438] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [nexus10.nlroc:27438] mca:base:select:(  plm) Querying component [slurm]
> [nexus10.nlroc:27438] mca:base:select:(  plm) Skipping component [slurm]. 
> Query failed to return a module
> [nexus10.nlroc:27438] mca:base:select:(  plm) Selected component [rsh]
> [nexus10.nlroc:27438] mca: base: close: component isolated closed
> [nexus10.nlroc:27438] mca: base: close: unloading component isolated
> [nexus10.nlroc:27438] mca: base: close: component slurm closed
> [nexus10.nlroc:27438] mca: base: close: unloading component slurm
> 
> and  it stop there
> 
> 
> 
> 
> On Mon, Jul 14, 2014 at 8:56 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Hmmm...no, it worked just fine for me. It sounds like something else is going 
> on.
> 
> Try configuring OMPI with --enable-debug, and then add -mca plm_base_verbose 
> 10 to get a better sense of what is going on.
> 
> 
> On Jul 14, 2014, at 10:27 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> I confess I haven't tested no_tree_spawn in ages, so it is quite possible it 
>> has suffered bit rot. I can try to take a look at it in a bit
>> 
>> 
>> On Jul 14, 2014, at 10:13 AM, Ricardo Fernández-Perea 
>> <rfernandezpe...@gmail.com> wrote:
>> 
>>> Thank you for the fast answer 
>>> 
>>> While that resolve my problem with cross ssh authentication   a command as
>>> 
>>> /opt/openmpi/bin/mpirun  --mca mtl mx --mca pml cm --mca 
>>> plm_rsh_no_tree_spawn 1 -hostfile hostfile ompi_info
>>> 
>>> just hung with no output and although there is a ssh connexion no orte 
>>> program is initiated in the destination nodes
>>> 
>>> and while 
>>> 
>>> /opt/openmpi/bin/mpirun  -host host18 ompi_info
>>> 
>>> works
>>> 
>>> /opt/openmpi/bin/mpirun  --mca plm_rsh_no_tree_spawn 1 -host host18 
>>> ompi_info
>>> 
>>> hangs, is there some condition in the use of this parameter.
>>> 
>>> Yours truly
>>> 
>>> Ricardo 
>>> 
>>> 
>>> 
>>> On Mon, Jul 14, 2014 at 6:35 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> During the 1.7 series and for all follow-on series, OMPI changed to a mode 
>>> where it launches a daemon on all allocated nodes at the startup of mpirun. 
>>> This allows us to determine the hardware topology of the nodes and take 
>>> that into account when mapping. You can override that behavior by either 
>>> adding --novm to your cmd line (which will impact your mapping/binding 
>>> options), or by specifying the hosts to use by editing your hostfile, or 
>>> adding --host host1,host2 to your cmd line
>>> 
>>> The rsh launcher defaults to a tree-based pattern, thus requiring that we 
>>> be able to ssh from one compute node to another. You can change that to a 
>>> less scalable direct mode by adding
>>> 
>>> --mca plm_rsh_no_tree_spawn 1
>>> 
>>> to the cmd line
>>> 
>>> 
>>> On Jul 14, 2014, at 9:21 AM, Ricardo Fernández-Perea 
>>> <rfernandezpe...@gmail.com> wrote:
>>> 
>>> > I'm trying to update to openMPI 1.8.1 thru ssh  and Myrinet
>>> >
>>> > running a command as
>>> >
>>> > /opt/openmpi/bin/mpirun --verbose --mca mtl mx --mca pml cm  -hostfile 
>>> > hostfile -np 16
>>> >
>>> > when the hostfile contain only two nodes as
>>> >
>>> > host1 slots=8 max-slots=8
>>> > host2 slots=8 max-slots=8
>>> >
>>> > it runs perfectly but when the hostfile has a third node as
>>> >
>>> >
>>> > host1 slots=8 max-slots=8
>>> > host2 slots=8 max-slots=8
>>> > host3 slots=8 max-slots=8
>>> >
>>> > it try to establish an ssh connection between  the running hosts1 and 
>>> > host3 that should not run any process  that fails hanging the process 
>>> > without signaling.
>>> >
>>> >
>>> > my ompi_info is as follow
>>> >
>>> >                 Package: Open MPI XXX Distribution
>>> >                 Open MPI: 1.8.1
>>> >   Open MPI repo revision: r31483
>>> >    Open MPI release date: Apr 22, 2014
>>> >                 Open RTE: 1.8.1
>>> >   Open RTE repo revision: r31483
>>> >    Open RTE release date: Apr 22, 2014
>>> >                     OPAL: 1.8.1
>>> >       OPAL repo revision: r31483
>>> >        OPAL release date: Apr 22, 2014
>>> >                  MPI API: 3.0
>>> >             Ident string: 1.8.1
>>> >                   Prefix: /opt/openmpi
>>> >  Configured architecture: x86_64-apple-darwin9.8.0
>>> >           Configure host: XXXX
>>> >            Configured by: XXXX
>>> >            Configured on: Thu Jun 12 10:37:33 CEST 2014
>>> >           Configure host: XXXX
>>> >                 Built by: XXXX
>>> >                 Built on: Thu Jun 12 11:13:16 CEST 2014
>>> >               Built host: XXXX
>>> >               C bindings: yes
>>> >             C++ bindings: yes
>>> >              Fort mpif.h: yes (single underscore)
>>> >             Fort use mpi: yes (full: ignore TKR)
>>> >        Fort use mpi size: deprecated-ompi-info-value
>>> >         Fort use mpi_f08: yes
>>> >  Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
>>> >                           limitations in the ifort compiler, does not 
>>> > support
>>> >                           the following: array subsections, direct 
>>> > passthru
>>> >                           (where possible) to underlying Open MPI's C
>>> >                           functionality
>>> >   Fort mpi_f08 subarrays: no
>>> >            Java bindings: no
>>> >   Wrapper compiler rpath: unnecessary
>>> >               C compiler: icc
>>> >      C compiler absolute: /opt/intel/Compiler/11.1/080/bin/intel64/icc
>>> >   C compiler family name: INTEL
>>> >       C compiler version: 1110.20091130
>>> >             C++ compiler: icpc
>>> >    C++ compiler absolute: /opt/intel/Compiler/11.1/080/bin/intel64/icpc
>>> >            Fort compiler: ifort
>>> >        Fort compiler abs: /opt/intel/Compiler/11.1/080/bin/intel64/ifort
>>> >          Fort ignore TKR: yes (!DEC$ ATTRIBUTES NO_ARG_CHECK ::)
>>> >    Fort 08 assumed shape: no
>>> >       Fort optional args: yes
>>> >       Fort BIND(C) (all): yes
>>> >       Fort ISO_C_BINDING: yes
>>> >  Fort SUBROUTINE BIND(C): yes
>>> >        Fort TYPE,BIND(C): yes
>>> >  Fort T,BIND(C,name="a"): yes
>>> >             Fort PRIVATE: yes
>>> >           Fort PROTECTED: yes
>>> >            Fort ABSTRACT: yes
>>> >        Fort ASYNCHRONOUS: yes
>>> >           Fort PROCEDURE: yes
>>> >  Fort f08 using wrappers: yes
>>> >              C profiling: yes
>>> >            C++ profiling: yes
>>> >    Fort mpif.h profiling: yes
>>> >   Fort use mpi profiling: yes
>>> >    Fort use mpi_f08 prof: yes
>>> >           C++ exceptions: no
>>> >           Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: 
>>> > yes,
>>> >                           OMPI progress: no, ORTE progress: yes, Event 
>>> > lib:
>>> >                           yes)
>>> >            Sparse Groups: no
>>> >   Internal debug support: no
>>> >   MPI interface warnings: yes
>>> >      MPI parameter check: runtime
>>> > Memory profiling support: no
>>> > Memory debugging support: no
>>> >          libltdl support: yes
>>> >    Heterogeneous support: no
>>> >  mpirun default --prefix: no
>>> >          MPI I/O support: yes
>>> >        MPI_WTIME support: gettimeofday
>>> >      Symbol vis. support: yes
>>> >    Host topology support: yes
>>> >           MPI extensions:
>>> >    FT Checkpoint support: no (checkpoint thread: no)
>>> >    C/R Enabled Debugging: no
>>> >      VampirTrace support: yes
>>> >   MPI_MAX_PROCESSOR_NAME: 256
>>> >     MPI_MAX_ERROR_STRING: 256
>>> >      MPI_MAX_OBJECT_NAME: 64
>>> >         MPI_MAX_INFO_KEY: 36
>>> >         MPI_MAX_INFO_VAL: 256
>>> >        MPI_MAX_PORT_NAME: 1024
>>> >   MPI_MAX_DATAREP_STRING: 128
>>> >
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> > Link to this post: 
>>> > http://www.open-mpi.org/community/lists/users/2014/07/24764.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/07/24765.php
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/07/24766.php
>> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24768.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24769.php

Reply via email to