I'm also puzzled by your timing statement - I can't replicate it: 07:41:43 $ time mpirun -n 1 ./hello_c Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI rhc@bend001 Distribution, ident: 1.9a1r32577, repo rev: r32577, Unreleased developer copy, 125)
real 0m0.547s user 0m0.043s sys 0m0.046s The entire thing ran in 0.5 seconds On Aug 22, 2014, at 6:33 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > Hi, > The default delimiter is ";" . You can change delimiter with > mca_base_env_list_delimiter. > > > > On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov <tismagi...@mail.ru> wrote: > Hello! > If i use latest night snapshot: > $ ompi_info -V > Open MPI v1.9a1r32570 > > In programm hello_c initialization takes ~1 min > In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less) > if i use > $mpirun --mca mca_base_env_list 'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' > --map-by slot:pe=8 -np 1 ./hello_c > i got error > config_parser.c:657 MXM ERROR Invalid value for SHM_KCOPY_MODE: > 'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect] > but with -x all works fine (but with warn) > $mpirun -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c > WARNING: The mechanism by which environment variables are explicitly > .............. > .............. > .............. > Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI > semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug > 21, 2014 (nightly snapshot tarball), 146) > > > Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain <r...@open-mpi.org>: > Not sure I understand. The problem has been fixed in both the trunk and the > 1.8 branch now, so you should be able to work with either of those nightly > builds. > > On Aug 21, 2014, at 12:02 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: > >> Have i I any opportunity to run mpi jobs? >> >> >> Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain <r...@open-mpi.org>: >> yes, i know - it is cmr'd >> >> On Aug 20, 2014, at 10:26 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: >> >>> btw, we get same error in v1.8 branch as well. >>> >>> >>> On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> It was not yet fixed - but should be now. >>> >>> On Aug 20, 2014, at 6:39 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: >>> >>>> Hello! >>>> >>>> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i still have >>>> the problem >>>> >>>> a) >>>> $ mpirun -np 1 ./hello_c >>>> >>>> -------------------------------------------------------------------------- >>>> An ORTE daemon has unexpectedly failed after launch and before >>>> communicating back to mpirun. This could be caused by a number >>>> of factors, including an inability to create a connection back >>>> to mpirun due to a lack of common network interfaces and/or no >>>> route found between them. Please check network connectivity >>>> (including firewalls and network routing requirements). >>>> -------------------------------------------------------------------------- >>>> >>>> b) >>>> $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c >>>> -------------------------------------------------------------------------- >>>> An ORTE daemon has unexpectedly failed after launch and before >>>> communicating back to mpirun. This could be caused by a number >>>> of factors, including an inability to create a connection back >>>> to mpirun due to a lack of common network interfaces and/or no >>>> route found between them. Please check network connectivity >>>> (including firewalls and network routing requirements). >>>> -------------------------------------------------------------------------- >>>> >>>> c) >>>> >>>> $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca >>>> plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 >>>> ./hello_c >>>> >>>> [compiler-2:14673] mca:base:select:( plm) Querying component [isolated] >>>> [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] >>>> set priority to 0 >>>> [compiler-2:14673] mca:base:select:( plm) Querying component [rsh] >>>> [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set >>>> priority to 10 >>>> [compiler-2:14673] mca:base:select:( plm) Querying component [slurm] >>>> [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set >>>> priority to 75 >>>> [compiler-2:14673] mca:base:select:( plm) Selected component [slurm] >>>> [compiler-2:14673] mca: base: components_register: registering oob >>>> components >>>> [compiler-2:14673] mca: base: components_register: found loaded component >>>> tcp >>>> [compiler-2:14673] mca: base: components_register: component tcp register >>>> function successful >>>> [compiler-2:14673] mca: base: components_open: opening oob components >>>> [compiler-2:14673] mca: base: components_open: found loaded component tcp >>>> [compiler-2:14673] mca: base: components_open: component tcp open function >>>> successful >>>> [compiler-2:14673] mca:oob:select: checking available component tcp >>>> [compiler-2:14673] mca:oob:select: Querying component [tcp] >>>> [compiler-2:14673] oob:tcp: component_available called >>>> [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>> [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>> [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>> [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>> [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>> [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our >>>> list of V4 connections >>>> [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>> [compiler-2:14673] [[49095,0],0] TCP STARTUP >>>> [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0 >>>> [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460 >>>> [compiler-2:14673] mca:oob:select: Adding component to end >>>> [compiler-2:14673] mca:oob:select: Found 1 active transports >>>> [compiler-2:14673] mca: base: components_register: registering rml >>>> components >>>> [compiler-2:14673] mca: base: components_register: found loaded component >>>> oob >>>> [compiler-2:14673] mca: base: components_register: component oob has no >>>> register or open function >>>> [compiler-2:14673] mca: base: components_open: opening rml components >>>> [compiler-2:14673] mca: base: components_open: found loaded component oob >>>> [compiler-2:14673] mca: base: components_open: component oob open function >>>> successful >>>> [compiler-2:14673] orte_rml_base_select: initializing rml component oob >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 10 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 12 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 9 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 34 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 2 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 21 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 22 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 45 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 46 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 1 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:14673] [[49095,0],0] posting recv >>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 27 for >>>> peer [[WILDCARD],WILDCARD] >>>> Daemon was launched on node1-128-01 - beginning to initialize >>>> -------------------------------------------------------------------------- >>>> WARNING: An invalid value was given for oob_tcp_if_include. This >>>> value will be ignored. >>>> >>>> Local host: node1-128-01 >>>> Value: "ib0" >>>> Message: Invalid specification (missing "/") >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> None of the TCP networks specified to be included for out-of-band >>>> communications >>>> could be found: >>>> >>>> Value given: >>>> >>>> Please revise the specification and try again. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> No network interfaces were found for out-of-band communications. We require >>>> at least one available network for out-of-band messaging. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> It looks like orte_init failed for some reason; your parallel process is >>>> likely to abort. There are many reasons that a parallel process can >>>> fail during orte_init; some of which are due to configuration or >>>> environment problems. This failure appears to be an internal failure; >>>> here's some additional information (which may only be relevant to an >>>> Open MPI developer): >>>> >>>> orte_oob_base_select failed >>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>> -------------------------------------------------------------------------- >>>> srun: error: node1-128-01: task 0: Exited with exit code 213 >>>> srun: Terminating job step 661215.0 >>>> -------------------------------------------------------------------------- >>>> An ORTE daemon has unexpectedly failed after launch and before >>>> communicating back to mpirun. This could be caused by a number >>>> of factors, including an inability to create a connection back >>>> to mpirun due to a lack of common network interfaces and/or no >>>> route found between them. Please check network connectivity >>>> (including firewalls and network routing requirements). >>>> -------------------------------------------------------------------------- >>>> [compiler-2:14673] [[49095,0],0] orted_cmd: received halt_vm cmd >>>> [compiler-2:14673] mca: base: close: component oob closed >>>> [compiler-2:14673] mca: base: close: unloading component oob >>>> [compiler-2:14673] [[49095,0],0] TCP SHUTDOWN >>>> [compiler-2:14673] mca: base: close: component tcp closed >>>> [compiler-2:14673] mca: base: close: unloading component tcp >>>> >>>> >>>> >>>> >>>> Tue, 12 Aug 2014 18:33:24 +0000 от "Jeff Squyres (jsquyres)" >>>> <jsquy...@cisco.com>: >>>> I filed the following ticket: >>>> >>>> https://svn.open-mpi.org/trac/ompi/ticket/4857 >>>> >>>> >>>> On Aug 12, 2014, at 12:39 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >>>> wrote: >>>> >>>> > (please keep the users list CC'ed) >>>> > >>>> > We talked about this on the weekly engineering call today. Ralph has an >>>> > idea what is happening -- I need to do a little investigation today and >>>> > file a bug. I'll make sure you're CC'ed on the bug ticket. >>>> > >>>> > >>>> > >>>> > On Aug 12, 2014, at 12:27 PM, Timur Ismagilov <tismagi...@mail.ru> wrote: >>>> > >>>> >> I don't have this error in OMPI 1.9a1r32252 and OMPI 1.8.1 (with --mca >>>> >> oob_tcp_if_include ib0), but in all latest night snapshots i got this >>>> >> error. >>>> >> >>>> >> >>>> >> Tue, 12 Aug 2014 13:08:12 +0000 от "Jeff Squyres (jsquyres)" >>>> >> <jsquy...@cisco.com>: >>>> >> Are you running any kind of firewall on the node where mpirun is >>>> >> invoked? Open MPI needs to be able to use arbitrary TCP ports between >>>> >> the servers on which it runs. >>>> >> >>>> >> This second mail seems to imply a bug in OMPI's oob_tcp_if_include >>>> >> param handling, however -- it's supposed to be able to handle an >>>> >> interface name (not just a network specification). >>>> >> >>>> >> Ralph -- can you have a look? >>>> >> >>>> >> >>>> >> On Aug 12, 2014, at 8:41 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: >>>> >> >>>> >>> When i add --mca oob_tcp_if_include ib0 (infiniband interface) to >>>> >>> mpirun (as it was here: >>>> >>> http://www.open-mpi.org/community/lists/users/2014/07/24857.php ) i >>>> >>> got this output: >>>> >>> >>>> >>> [compiler-2:08792] mca:base:select:( plm) Querying component [isolated] >>>> >>> [compiler-2:08792] mca:base:select:( plm) Query of component >>>> >>> [isolated] set priority to 0 >>>> >>> [compiler-2:08792] mca:base:select:( plm) Querying component [rsh] >>>> >>> [compiler-2:08792] mca:base:select:( plm) Query of component [rsh] set >>>> >>> priority to 10 >>>> >>> [compiler-2:08792] mca:base:select:( plm) Querying component [slurm] >>>> >>> [compiler-2:08792] mca:base:select:( plm) Query of component [slurm] >>>> >>> set priority to 75 >>>> >>> [compiler-2:08792] mca:base:select:( plm) Selected component [slurm] >>>> >>> [compiler-2:08792] mca: base: components_register: registering oob >>>> >>> components >>>> >>> [compiler-2:08792] mca: base: components_register: found loaded >>>> >>> component tcp >>>> >>> [compiler-2:08792] mca: base: components_register: component tcp >>>> >>> register function successful >>>> >>> [compiler-2:08792] mca: base: components_open: opening oob components >>>> >>> [compiler-2:08792] mca: base: components_open: found loaded component >>>> >>> tcp >>>> >>> [compiler-2:08792] mca: base: components_open: component tcp open >>>> >>> function successful >>>> >>> [compiler-2:08792] mca:oob:select: checking available component tcp >>>> >>> [compiler-2:08792] mca:oob:select: Querying component [tcp] >>>> >>> [compiler-2:08792] oob:tcp: component_available called >>>> >>> [compiler-2:08792] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>> >>> [compiler-2:08792] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>> >>> [compiler-2:08792] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>> >>> [compiler-2:08792] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>> >>> [compiler-2:08792] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>> >>> [compiler-2:08792] [[42190,0],0] oob:tcp:init adding 10.128.0.4 to our >>>> >>> list of V4 connections >>>> >>> [compiler-2:08792] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>> >>> [compiler-2:08792] [[42190,0],0] TCP STARTUP >>>> >>> [compiler-2:08792] [[42190,0],0] attempting to bind to IPv4 port 0 >>>> >>> [compiler-2:08792] [[42190,0],0] assigned IPv4 port 53883 >>>> >>> [compiler-2:08792] mca:oob:select: Adding component to end >>>> >>> [compiler-2:08792] mca:oob:select: Found 1 active transports >>>> >>> [compiler-2:08792] mca: base: components_register: registering rml >>>> >>> components >>>> >>> [compiler-2:08792] mca: base: components_register: found loaded >>>> >>> component oob >>>> >>> [compiler-2:08792] mca: base: components_register: component oob has >>>> >>> no register or open function >>>> >>> [compiler-2:08792] mca: base: components_open: opening rml components >>>> >>> [compiler-2:08792] mca: base: components_open: found loaded component >>>> >>> oob >>>> >>> [compiler-2:08792] mca: base: components_open: component oob open >>>> >>> function successful >>>> >>> [compiler-2:08792] orte_rml_base_select: initializing rml component oob >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 30 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 15 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 32 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 33 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 5 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 10 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 12 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 9 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 34 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 2 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 21 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 22 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 45 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 46 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 1 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 27 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> Daemon was launched on node1-128-01 - beginning to initialize >>>> >>> Daemon was launched on node1-128-02 - beginning to initialize >>>> >>> -------------------------------------------------------------------------- >>>> >>> WARNING: An invalid value was given for oob_tcp_if_include. This >>>> >>> value will be ignored. >>>> >>> >>>> >>> Local host: node1-128-01 >>>> >>> Value: "ib0" >>>> >>> Message: Invalid specification (missing "/") >>>> >>> -------------------------------------------------------------------------- >>>> >>> -------------------------------------------------------------------------- >>>> >>> WARNING: An invalid value was given for oob_tcp_if_include. This >>>> >>> value will be ignored. >>>> >>> >>>> >>> Local host: node1-128-02 >>>> >>> Value: "ib0" >>>> >>> Message: Invalid specification (missing "/") >>>> >>> -------------------------------------------------------------------------- >>>> >>> -------------------------------------------------------------------------- >>>> >>> None of the TCP networks specified to be included for out-of-band >>>> >>> communications >>>> >>> could be found: >>>> >>> >>>> >>> Value given: >>>> >>> >>>> >>> Please revise the specification and try again. >>>> >>> -------------------------------------------------------------------------- >>>> >>> -------------------------------------------------------------------------- >>>> >>> None of the TCP networks specified to be included for out-of-band >>>> >>> communications >>>> >>> could be found: >>>> >>> >>>> >>> Value given: >>>> >>> >>>> >>> Please revise the specification and try again. >>>> >>> -------------------------------------------------------------------------- >>>> >>> -------------------------------------------------------------------------- >>>> >>> No network interfaces were found for out-of-band communications. We >>>> >>> require >>>> >>> at least one available network for out-of-band messaging. >>>> >>> -------------------------------------------------------------------------- >>>> >>> -------------------------------------------------------------------------- >>>> >>> No network interfaces were found for out-of-band communications. We >>>> >>> require >>>> >>> at least one available network for out-of-band messaging. >>>> >>> -------------------------------------------------------------------------- >>>> >>> -------------------------------------------------------------------------- >>>> >>> It looks like orte_init failed for some reason; your parallel process >>>> >>> is >>>> >>> likely to abort. There are many reasons that a parallel process can >>>> >>> fail during orte_init; some of which are due to configuration or >>>> >>> environment problems. This failure appears to be an internal failure; >>>> >>> here's some additional information (which may only be relevant to an >>>> >>> Open MPI developer): >>>> >>> >>>> >>> orte_oob_base_select failed >>>> >>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>> >>> -------------------------------------------------------------------------- >>>> >>> -------------------------------------------------------------------------- >>>> >>> It looks like orte_init failed for some reason; your parallel process >>>> >>> is >>>> >>> likely to abort. There are many reasons that a parallel process can >>>> >>> fail during orte_init; some of which are due to configuration or >>>> >>> environment problems. This failure appears to be an internal failure; >>>> >>> here's some additional information (which may only be relevant to an >>>> >>> Open MPI developer): >>>> >>> >>>> >>> orte_oob_base_select failed >>>> >>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>> >>> -------------------------------------------------------------------------- >>>> >>> srun: error: node1-128-02: task 1: Exited with exit code 213 >>>> >>> srun: Terminating job step 657300.0 >>>> >>> srun: error: node1-128-01: task 0: Exited with exit code 213 >>>> >>> -------------------------------------------------------------------------- >>>> >>> An ORTE daemon has unexpectedly failed after launch and before >>>> >>> communicating back to mpirun. This could be caused by a number >>>> >>> of factors, including an inability to create a connection back >>>> >>> to mpirun due to a lack of common network interfaces and/or no >>>> >>> route found between them. Please check network connectivity >>>> >>> (including firewalls and network routing requirements). >>>> >>> -------------------------------------------------------------------------- >>>> >>> [compiler-2:08792] [[42190,0],0] orted_cmd: received halt_vm cmd >>>> >>> [compiler-2:08792] mca: base: close: component oob closed >>>> >>> [compiler-2:08792] mca: base: close: unloading component oob >>>> >>> [compiler-2:08792] [[42190,0],0] TCP SHUTDOWN >>>> >>> [compiler-2:08792] mca: base: close: component tcp closed >>>> >>> [compiler-2:08792] mca: base: close: unloading component tcp >>>> >>> >>>> >>> >>>> >>> >>>> >>> Tue, 12 Aug 2014 16:14:58 +0400 от Timur Ismagilov >>>> >>> <tismagi...@mail.ru>: >>>> >>> Hello! >>>> >>> >>>> >>> I have Open MPI v1.8.2rc4r32485 >>>> >>> >>>> >>> When i run hello_c, I got this error message >>>> >>> $mpirun -np 2 hello_c >>>> >>> >>>> >>> An ORTE daemon has unexpectedly failed after launch and before >>>> >>> >>>> >>> communicating back to mpirun. This could be caused by a number >>>> >>> of factors, including an inability to create a connection back >>>> >>> to mpirun due to a lack of common network interfaces and/or no >>>> >>> route found between them. Please check network connectivity >>>> >>> (including firewalls and network routing requirements). >>>> >>> >>>> >>> When i run with --debug-daemons --mca plm_base_verbose 5 -mca >>>> >>> oob_base_verbose 10 -mca rml_base_verbose 10 i got this output: >>>> >>> $mpirun --debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose >>>> >>> 10 -mca rml_base_verbose 10 -np 2 hello_c >>>> >>> >>>> >>> [compiler-2:08780] mca:base:select:( plm) Querying component [isolated] >>>> >>> [compiler-2:08780] mca:base:select:( plm) Query of component >>>> >>> [isolated] set priority to 0 >>>> >>> [compiler-2:08780] mca:base:select:( plm) Querying component [rsh] >>>> >>> [compiler-2:08780] mca:base:select:( plm) Query of component [rsh] set >>>> >>> priority to 10 >>>> >>> [compiler-2:08780] mca:base:select:( plm) Querying component [slurm] >>>> >>> [compiler-2:08780] mca:base:select:( plm) Query of component [slurm] >>>> >>> set priority to 75 >>>> >>> [compiler-2:08780] mca:base:select:( plm) Selected component [slurm] >>>> >>> [compiler-2:08780] mca: base: components_register: registering oob >>>> >>> components >>>> >>> [compiler-2:08780] mca: base: components_register: found loaded >>>> >>> component tcp >>>> >>> [compiler-2:08780] mca: base: components_register: component tcp >>>> >>> register function successful >>>> >>> [compiler-2:08780] mca: base: components_open: opening oob components >>>> >>> [compiler-2:08780] mca: base: components_open: found loaded component >>>> >>> tcp >>>> >>> [compiler-2:08780] mca: base: components_open: component tcp open >>>> >>> function successful >>>> >>> [compiler-2:08780] mca:oob:select: checking available component tcp >>>> >>> [compiler-2:08780] mca:oob:select: Querying component [tcp] >>>> >>> [compiler-2:08780] oob:tcp: component_available called >>>> >>> [compiler-2:08780] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>> >>> [compiler-2:08780] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.251.53 to >>>> >>> our list of V4 connections >>>> >>> [compiler-2:08780] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.0.4 to our >>>> >>> list of V4 connections >>>> >>> [compiler-2:08780] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.2.251.14 to >>>> >>> our list of V4 connections >>>> >>> [compiler-2:08780] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.128.0.4 to our >>>> >>> list of V4 connections >>>> >>> [compiler-2:08780] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 93.180.7.38 to >>>> >>> our list of V4 connections >>>> >>> [compiler-2:08780] [[42202,0],0] TCP STARTUP >>>> >>> [compiler-2:08780] [[42202,0],0] attempting to bind to IPv4 port 0 >>>> >>> [compiler-2:08780] [[42202,0],0] assigned IPv4 port 38420 >>>> >>> [compiler-2:08780] mca:oob:select: Adding component to end >>>> >>> [compiler-2:08780] mca:oob:select: Found 1 active transports >>>> >>> [compiler-2:08780] mca: base: components_register: registering rml >>>> >>> components >>>> >>> [compiler-2:08780] mca: base: components_register: found loaded >>>> >>> component oob >>>> >>> [compiler-2:08780] mca: base: components_register: component oob has >>>> >>> no register or open function >>>> >>> [compiler-2:08780] mca: base: components_open: opening rml components >>>> >>> [compiler-2:08780] mca: base: components_open: found loaded component >>>> >>> oob >>>> >>> [compiler-2:08780] mca: base: components_open: component oob open >>>> >>> function successful >>>> >>> [compiler-2:08780] orte_rml_base_select: initializing rml component oob >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 30 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 15 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 32 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 33 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 5 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 10 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 12 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 9 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 34 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 2 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 21 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 22 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 45 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 46 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 1 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 27 for >>>> >>> peer [[WILDCARD],WILDCARD] >>>> >>> Daemon was launched on node1-130-08 - beginning to initialize >>>> >>> Daemon was launched on node1-130-03 - beginning to initialize >>>> >>> Daemon was launched on node1-130-05 - beginning to initialize >>>> >>> Daemon was launched on node1-130-02 - beginning to initialize >>>> >>> Daemon was launched on node1-130-01 - beginning to initialize >>>> >>> Daemon was launched on node1-130-04 - beginning to initialize >>>> >>> Daemon was launched on node1-130-07 - beginning to initialize >>>> >>> Daemon was launched on node1-130-06 - beginning to initialize >>>> >>> Daemon [[42202,0],3] checking in as pid 7178 on host node1-130-03 >>>> >>> [node1-130-03:07178] [[42202,0],3] orted: up and running - waiting for >>>> >>> commands! >>>> >>> Daemon [[42202,0],2] checking in as pid 13581 on host node1-130-02 >>>> >>> [node1-130-02:13581] [[42202,0],2] orted: up and running - waiting for >>>> >>> commands! >>>> >>> Daemon [[42202,0],1] checking in as pid 17220 on host node1-130-01 >>>> >>> [node1-130-01:17220] [[42202,0],1] orted: up and running - waiting for >>>> >>> commands! >>>> >>> Daemon [[42202,0],5] checking in as pid 6663 on host node1-130-05 >>>> >>> [node1-130-05:06663] [[42202,0],5] orted: up and running - waiting for >>>> >>> commands! >>>> >>> Daemon [[42202,0],8] checking in as pid 6683 on host node1-130-08 >>>> >>> [node1-130-08:06683] [[42202,0],8] orted: up and running - waiting for >>>> >>> commands! >>>> >>> Daemon [[42202,0],7] checking in as pid 7877 on host node1-130-07 >>>> >>> [node1-130-07:07877] [[42202,0],7] orted: up and running - waiting for >>>> >>> commands! >>>> >>> Daemon [[42202,0],4] checking in as pid 7735 on host node1-130-04 >>>> >>> [node1-130-04:07735] [[42202,0],4] orted: up and running - waiting for >>>> >>> commands! >>>> >>> Daemon [[42202,0],6] checking in as pid 8451 on host node1-130-06 >>>> >>> [node1-130-06:08451] [[42202,0],6] orted: up and running - waiting for >>>> >>> commands! >>>> >>> srun: error: node1-130-03: task 2: Exited with exit code 1 >>>> >>> srun: Terminating job step 657040.1 >>>> >>> srun: error: node1-130-02: task 1: Exited with exit code 1 >>>> >>> slurmd[node1-130-04]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 >>>> >>> WITH SIGNAL 9 *** >>>> >>> slurmd[node1-130-07]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 >>>> >>> WITH SIGNAL 9 *** >>>> >>> slurmd[node1-130-06]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 >>>> >>> WITH SIGNAL 9 *** >>>> >>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish. >>>> >>> srun: error: node1-130-01: task 0: Exited with exit code 1 >>>> >>> srun: error: node1-130-05: task 4: Exited with exit code 1 >>>> >>> srun: error: node1-130-08: task 7: Exited with exit code 1 >>>> >>> srun: error: node1-130-07: task 6: Exited with exit code 1 >>>> >>> srun: error: node1-130-04: task 3: Killed >>>> >>> srun: error: node1-130-06: task 5: Killed >>>> >>> -------------------------------------------------------------------------- >>>> >>> An ORTE daemon has unexpectedly failed after launch and before >>>> >>> communicating back to mpirun. This could be caused by a number >>>> >>> of factors, including an inability to create a connection back >>>> >>> to mpirun due to a lack of common network interfaces and/or no >>>> >>> route found between them. Please check network connectivity >>>> >>> (including firewalls and network routing requirements). >>>> >>> -------------------------------------------------------------------------- >>>> >>> [compiler-2:08780] [[42202,0],0] orted_cmd: received halt_vm cmd >>>> >>> [compiler-2:08780] mca: base: close: component oob closed >>>> >>> [compiler-2:08780] mca: base: close: unloading component oob >>>> >>> [compiler-2:08780] [[42202,0],0] TCP SHUTDOWN >>>> >>> [compiler-2:08780] mca: base: close: component tcp closed >>>> >>> [compiler-2:08780] mca: base: close: unloading component tcp >>>> >>> >>>> >>> _______________________________________________ >>>> >>> users mailing list >>>> >>> us...@open-mpi.org >>>> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> Link to this post: >>>> >>> http://www.open-mpi.org/community/lists/users/2014/08/24987.php >>>> >>> >>>> >>> >>>> >>> >>>> >>> _______________________________________________ >>>> >>> users mailing list >>>> >>> us...@open-mpi.org >>>> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> Link to this post: >>>> >>> http://www.open-mpi.org/community/lists/users/2014/08/24988.php >>>> >> >>>> >> >>>> >> -- >>>> >> Jeff Squyres >>>> >> jsquy...@cisco.com >>>> >> For corporate legal information go to: >>>> >> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> > >>>> > >>>> > -- >>>> > Jeff Squyres >>>> > jsquy...@cisco.com >>>> > For corporate legal information go to: >>>> > http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> > >>>> > _______________________________________________ >>>> > users mailing list >>>> > us...@open-mpi.org >>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> > Link to this post: >>>> > http://www.open-mpi.org/community/lists/users/2014/08/25001.php >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/25086.php >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25093.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25094.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25095.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25105.php > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25127.php > > > > -- > > Kind Regards, > > M. > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25128.php