It was not yet fixed - but should be now. On Aug 20, 2014, at 6:39 AM, Timur Ismagilov <tismagi...@mail.ru> wrote:
> Hello! > > As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i still have > the problem > > a) > $ mpirun -np 1 ./hello_c > > -------------------------------------------------------------------------- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -------------------------------------------------------------------------- > > b) > $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c > -------------------------------------------------------------------------- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -------------------------------------------------------------------------- > > c) > > $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose 5 > -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c > > [compiler-2:14673] mca:base:select:( plm) Querying component [isolated] > [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set > priority to 0 > [compiler-2:14673] mca:base:select:( plm) Querying component [rsh] > [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set > priority to 10 > [compiler-2:14673] mca:base:select:( plm) Querying component [slurm] > [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set > priority to 75 > [compiler-2:14673] mca:base:select:( plm) Selected component [slurm] > [compiler-2:14673] mca: base: components_register: registering oob components > [compiler-2:14673] mca: base: components_register: found loaded component tcp > [compiler-2:14673] mca: base: components_register: component tcp register > function successful > [compiler-2:14673] mca: base: components_open: opening oob components > [compiler-2:14673] mca: base: components_open: found loaded component tcp > [compiler-2:14673] mca: base: components_open: component tcp open function > successful > [compiler-2:14673] mca:oob:select: checking available component tcp > [compiler-2:14673] mca:oob:select: Querying component [tcp] > [compiler-2:14673] oob:tcp: component_available called > [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 > [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 > [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list > of V4 connections > [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 > [compiler-2:14673] [[49095,0],0] TCP STARTUP > [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0 > [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460 > [compiler-2:14673] mca:oob:select: Adding component to end > [compiler-2:14673] mca:oob:select: Found 1 active transports > [compiler-2:14673] mca: base: components_register: registering rml components > [compiler-2:14673] mca: base: components_register: found loaded component oob > [compiler-2:14673] mca: base: components_register: component oob has no > register or open function > [compiler-2:14673] mca: base: components_open: opening rml components > [compiler-2:14673] mca: base: components_open: found loaded component oob > [compiler-2:14673] mca: base: components_open: component oob open function > successful > [compiler-2:14673] orte_rml_base_select: initializing rml component oob > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 10 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 12 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 9 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 34 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 2 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 21 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 22 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 45 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 46 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 1 for peer > [[WILDCARD],WILDCARD] > [compiler-2:14673] [[49095,0],0] posting recv > [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 27 for peer > [[WILDCARD],WILDCARD] > Daemon was launched on node1-128-01 - beginning to initialize > -------------------------------------------------------------------------- > WARNING: An invalid value was given for oob_tcp_if_include. This > value will be ignored. > > Local host: node1-128-01 > Value: "ib0" > Message: Invalid specification (missing "/") > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > None of the TCP networks specified to be included for out-of-band > communications > could be found: > > Value given: > > Please revise the specification and try again. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > No network interfaces were found for out-of-band communications. We require > at least one available network for out-of-band messaging. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_oob_base_select failed > --> Returned value (null) (-43) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > srun: error: node1-128-01: task 0: Exited with exit code 213 > srun: Terminating job step 661215.0 > -------------------------------------------------------------------------- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -------------------------------------------------------------------------- > [compiler-2:14673] [[49095,0],0] orted_cmd: received halt_vm cmd > [compiler-2:14673] mca: base: close: component oob closed > [compiler-2:14673] mca: base: close: unloading component oob > [compiler-2:14673] [[49095,0],0] TCP SHUTDOWN > [compiler-2:14673] mca: base: close: component tcp closed > [compiler-2:14673] mca: base: close: unloading component tcp > > > > > Tue, 12 Aug 2014 18:33:24 +0000 от "Jeff Squyres (jsquyres)" > <jsquy...@cisco.com>: > I filed the following ticket: > > https://svn.open-mpi.org/trac/ompi/ticket/4857 > > > On Aug 12, 2014, at 12:39 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > > (please keep the users list CC'ed) > > > > We talked about this on the weekly engineering call today. Ralph has an > > idea what is happening -- I need to do a little investigation today and > > file a bug. I'll make sure you're CC'ed on the bug ticket. > > > > > > > > On Aug 12, 2014, at 12:27 PM, Timur Ismagilov <tismagi...@mail.ru> wrote: > > > >> I don't have this error in OMPI 1.9a1r32252 and OMPI 1.8.1 (with --mca > >> oob_tcp_if_include ib0), but in all latest night snapshots i got this > >> error. > >> > >> > >> Tue, 12 Aug 2014 13:08:12 +0000 от "Jeff Squyres (jsquyres)" > >> <jsquy...@cisco.com>: > >> Are you running any kind of firewall on the node where mpirun is invoked? > >> Open MPI needs to be able to use arbitrary TCP ports between the servers > >> on which it runs. > >> > >> This second mail seems to imply a bug in OMPI's oob_tcp_if_include param > >> handling, however -- it's supposed to be able to handle an interface name > >> (not just a network specification). > >> > >> Ralph -- can you have a look? > >> > >> > >> On Aug 12, 2014, at 8:41 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: > >> > >>> When i add --mca oob_tcp_if_include ib0 (infiniband interface) to mpirun > >>> (as it was here: > >>> http://www.open-mpi.org/community/lists/users/2014/07/24857.php ) i got > >>> this output: > >>> > >>> [compiler-2:08792] mca:base:select:( plm) Querying component [isolated] > >>> [compiler-2:08792] mca:base:select:( plm) Query of component [isolated] > >>> set priority to 0 > >>> [compiler-2:08792] mca:base:select:( plm) Querying component [rsh] > >>> [compiler-2:08792] mca:base:select:( plm) Query of component [rsh] set > >>> priority to 10 > >>> [compiler-2:08792] mca:base:select:( plm) Querying component [slurm] > >>> [compiler-2:08792] mca:base:select:( plm) Query of component [slurm] set > >>> priority to 75 > >>> [compiler-2:08792] mca:base:select:( plm) Selected component [slurm] > >>> [compiler-2:08792] mca: base: components_register: registering oob > >>> components > >>> [compiler-2:08792] mca: base: components_register: found loaded component > >>> tcp > >>> [compiler-2:08792] mca: base: components_register: component tcp register > >>> function successful > >>> [compiler-2:08792] mca: base: components_open: opening oob components > >>> [compiler-2:08792] mca: base: components_open: found loaded component tcp > >>> [compiler-2:08792] mca: base: components_open: component tcp open > >>> function successful > >>> [compiler-2:08792] mca:oob:select: checking available component tcp > >>> [compiler-2:08792] mca:oob:select: Querying component [tcp] > >>> [compiler-2:08792] oob:tcp: component_available called > >>> [compiler-2:08792] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>> [compiler-2:08792] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 > >>> [compiler-2:08792] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 > >>> [compiler-2:08792] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 > >>> [compiler-2:08792] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 > >>> [compiler-2:08792] [[42190,0],0] oob:tcp:init adding 10.128.0.4 to our > >>> list of V4 connections > >>> [compiler-2:08792] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 > >>> [compiler-2:08792] [[42190,0],0] TCP STARTUP > >>> [compiler-2:08792] [[42190,0],0] attempting to bind to IPv4 port 0 > >>> [compiler-2:08792] [[42190,0],0] assigned IPv4 port 53883 > >>> [compiler-2:08792] mca:oob:select: Adding component to end > >>> [compiler-2:08792] mca:oob:select: Found 1 active transports > >>> [compiler-2:08792] mca: base: components_register: registering rml > >>> components > >>> [compiler-2:08792] mca: base: components_register: found loaded component > >>> oob > >>> [compiler-2:08792] mca: base: components_register: component oob has no > >>> register or open function > >>> [compiler-2:08792] mca: base: components_open: opening rml components > >>> [compiler-2:08792] mca: base: components_open: found loaded component oob > >>> [compiler-2:08792] mca: base: components_open: component oob open > >>> function successful > >>> [compiler-2:08792] orte_rml_base_select: initializing rml component oob > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 30 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 15 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 32 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 33 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 5 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 10 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 12 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 9 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 34 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 2 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 21 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 22 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 45 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 46 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 1 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08792] [[42190,0],0] posting recv > >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 27 for > >>> peer [[WILDCARD],WILDCARD] > >>> Daemon was launched on node1-128-01 - beginning to initialize > >>> Daemon was launched on node1-128-02 - beginning to initialize > >>> -------------------------------------------------------------------------- > >>> WARNING: An invalid value was given for oob_tcp_if_include. This > >>> value will be ignored. > >>> > >>> Local host: node1-128-01 > >>> Value: "ib0" > >>> Message: Invalid specification (missing "/") > >>> -------------------------------------------------------------------------- > >>> -------------------------------------------------------------------------- > >>> WARNING: An invalid value was given for oob_tcp_if_include. This > >>> value will be ignored. > >>> > >>> Local host: node1-128-02 > >>> Value: "ib0" > >>> Message: Invalid specification (missing "/") > >>> -------------------------------------------------------------------------- > >>> -------------------------------------------------------------------------- > >>> None of the TCP networks specified to be included for out-of-band > >>> communications > >>> could be found: > >>> > >>> Value given: > >>> > >>> Please revise the specification and try again. > >>> -------------------------------------------------------------------------- > >>> -------------------------------------------------------------------------- > >>> None of the TCP networks specified to be included for out-of-band > >>> communications > >>> could be found: > >>> > >>> Value given: > >>> > >>> Please revise the specification and try again. > >>> -------------------------------------------------------------------------- > >>> -------------------------------------------------------------------------- > >>> No network interfaces were found for out-of-band communications. We > >>> require > >>> at least one available network for out-of-band messaging. > >>> -------------------------------------------------------------------------- > >>> -------------------------------------------------------------------------- > >>> No network interfaces were found for out-of-band communications. We > >>> require > >>> at least one available network for out-of-band messaging. > >>> -------------------------------------------------------------------------- > >>> -------------------------------------------------------------------------- > >>> It looks like orte_init failed for some reason; your parallel process is > >>> likely to abort. There are many reasons that a parallel process can > >>> fail during orte_init; some of which are due to configuration or > >>> environment problems. This failure appears to be an internal failure; > >>> here's some additional information (which may only be relevant to an > >>> Open MPI developer): > >>> > >>> orte_oob_base_select failed > >>> --> Returned value (null) (-43) instead of ORTE_SUCCESS > >>> -------------------------------------------------------------------------- > >>> -------------------------------------------------------------------------- > >>> It looks like orte_init failed for some reason; your parallel process is > >>> likely to abort. There are many reasons that a parallel process can > >>> fail during orte_init; some of which are due to configuration or > >>> environment problems. This failure appears to be an internal failure; > >>> here's some additional information (which may only be relevant to an > >>> Open MPI developer): > >>> > >>> orte_oob_base_select failed > >>> --> Returned value (null) (-43) instead of ORTE_SUCCESS > >>> -------------------------------------------------------------------------- > >>> srun: error: node1-128-02: task 1: Exited with exit code 213 > >>> srun: Terminating job step 657300.0 > >>> srun: error: node1-128-01: task 0: Exited with exit code 213 > >>> -------------------------------------------------------------------------- > >>> An ORTE daemon has unexpectedly failed after launch and before > >>> communicating back to mpirun. This could be caused by a number > >>> of factors, including an inability to create a connection back > >>> to mpirun due to a lack of common network interfaces and/or no > >>> route found between them. Please check network connectivity > >>> (including firewalls and network routing requirements). > >>> -------------------------------------------------------------------------- > >>> [compiler-2:08792] [[42190,0],0] orted_cmd: received halt_vm cmd > >>> [compiler-2:08792] mca: base: close: component oob closed > >>> [compiler-2:08792] mca: base: close: unloading component oob > >>> [compiler-2:08792] [[42190,0],0] TCP SHUTDOWN > >>> [compiler-2:08792] mca: base: close: component tcp closed > >>> [compiler-2:08792] mca: base: close: unloading component tcp > >>> > >>> > >>> > >>> Tue, 12 Aug 2014 16:14:58 +0400 от Timur Ismagilov <tismagi...@mail.ru>: > >>> Hello! > >>> > >>> I have Open MPI v1.8.2rc4r32485 > >>> > >>> When i run hello_c, I got this error message > >>> $mpirun -np 2 hello_c > >>> > >>> An ORTE daemon has unexpectedly failed after launch and before > >>> > >>> communicating back to mpirun. This could be caused by a number > >>> of factors, including an inability to create a connection back > >>> to mpirun due to a lack of common network interfaces and/or no > >>> route found between them. Please check network connectivity > >>> (including firewalls and network routing requirements). > >>> > >>> When i run with --debug-daemons --mca plm_base_verbose 5 -mca > >>> oob_base_verbose 10 -mca rml_base_verbose 10 i got this output: > >>> $mpirun --debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 10 > >>> -mca rml_base_verbose 10 -np 2 hello_c > >>> > >>> [compiler-2:08780] mca:base:select:( plm) Querying component [isolated] > >>> [compiler-2:08780] mca:base:select:( plm) Query of component [isolated] > >>> set priority to 0 > >>> [compiler-2:08780] mca:base:select:( plm) Querying component [rsh] > >>> [compiler-2:08780] mca:base:select:( plm) Query of component [rsh] set > >>> priority to 10 > >>> [compiler-2:08780] mca:base:select:( plm) Querying component [slurm] > >>> [compiler-2:08780] mca:base:select:( plm) Query of component [slurm] set > >>> priority to 75 > >>> [compiler-2:08780] mca:base:select:( plm) Selected component [slurm] > >>> [compiler-2:08780] mca: base: components_register: registering oob > >>> components > >>> [compiler-2:08780] mca: base: components_register: found loaded component > >>> tcp > >>> [compiler-2:08780] mca: base: components_register: component tcp register > >>> function successful > >>> [compiler-2:08780] mca: base: components_open: opening oob components > >>> [compiler-2:08780] mca: base: components_open: found loaded component tcp > >>> [compiler-2:08780] mca: base: components_open: component tcp open > >>> function successful > >>> [compiler-2:08780] mca:oob:select: checking available component tcp > >>> [compiler-2:08780] mca:oob:select: Querying component [tcp] > >>> [compiler-2:08780] oob:tcp: component_available called > >>> [compiler-2:08780] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > >>> [compiler-2:08780] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 > >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.251.53 to our > >>> list of V4 connections > >>> [compiler-2:08780] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 > >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.0.4 to our list > >>> of V4 connections > >>> [compiler-2:08780] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 > >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.2.251.14 to our > >>> list of V4 connections > >>> [compiler-2:08780] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 > >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.128.0.4 to our > >>> list of V4 connections > >>> [compiler-2:08780] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 > >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 93.180.7.38 to our > >>> list of V4 connections > >>> [compiler-2:08780] [[42202,0],0] TCP STARTUP > >>> [compiler-2:08780] [[42202,0],0] attempting to bind to IPv4 port 0 > >>> [compiler-2:08780] [[42202,0],0] assigned IPv4 port 38420 > >>> [compiler-2:08780] mca:oob:select: Adding component to end > >>> [compiler-2:08780] mca:oob:select: Found 1 active transports > >>> [compiler-2:08780] mca: base: components_register: registering rml > >>> components > >>> [compiler-2:08780] mca: base: components_register: found loaded component > >>> oob > >>> [compiler-2:08780] mca: base: components_register: component oob has no > >>> register or open function > >>> [compiler-2:08780] mca: base: components_open: opening rml components > >>> [compiler-2:08780] mca: base: components_open: found loaded component oob > >>> [compiler-2:08780] mca: base: components_open: component oob open > >>> function successful > >>> [compiler-2:08780] orte_rml_base_select: initializing rml component oob > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 30 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 15 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 32 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 33 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 5 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 10 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 12 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 9 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 34 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 2 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 21 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 22 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 45 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 46 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 1 for > >>> peer [[WILDCARD],WILDCARD] > >>> [compiler-2:08780] [[42202,0],0] posting recv > >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 27 for > >>> peer [[WILDCARD],WILDCARD] > >>> Daemon was launched on node1-130-08 - beginning to initialize > >>> Daemon was launched on node1-130-03 - beginning to initialize > >>> Daemon was launched on node1-130-05 - beginning to initialize > >>> Daemon was launched on node1-130-02 - beginning to initialize > >>> Daemon was launched on node1-130-01 - beginning to initialize > >>> Daemon was launched on node1-130-04 - beginning to initialize > >>> Daemon was launched on node1-130-07 - beginning to initialize > >>> Daemon was launched on node1-130-06 - beginning to initialize > >>> Daemon [[42202,0],3] checking in as pid 7178 on host node1-130-03 > >>> [node1-130-03:07178] [[42202,0],3] orted: up and running - waiting for > >>> commands! > >>> Daemon [[42202,0],2] checking in as pid 13581 on host node1-130-02 > >>> [node1-130-02:13581] [[42202,0],2] orted: up and running - waiting for > >>> commands! > >>> Daemon [[42202,0],1] checking in as pid 17220 on host node1-130-01 > >>> [node1-130-01:17220] [[42202,0],1] orted: up and running - waiting for > >>> commands! > >>> Daemon [[42202,0],5] checking in as pid 6663 on host node1-130-05 > >>> [node1-130-05:06663] [[42202,0],5] orted: up and running - waiting for > >>> commands! > >>> Daemon [[42202,0],8] checking in as pid 6683 on host node1-130-08 > >>> [node1-130-08:06683] [[42202,0],8] orted: up and running - waiting for > >>> commands! > >>> Daemon [[42202,0],7] checking in as pid 7877 on host node1-130-07 > >>> [node1-130-07:07877] [[42202,0],7] orted: up and running - waiting for > >>> commands! > >>> Daemon [[42202,0],4] checking in as pid 7735 on host node1-130-04 > >>> [node1-130-04:07735] [[42202,0],4] orted: up and running - waiting for > >>> commands! > >>> Daemon [[42202,0],6] checking in as pid 8451 on host node1-130-06 > >>> [node1-130-06:08451] [[42202,0],6] orted: up and running - waiting for > >>> commands! > >>> srun: error: node1-130-03: task 2: Exited with exit code 1 > >>> srun: Terminating job step 657040.1 > >>> srun: error: node1-130-02: task 1: Exited with exit code 1 > >>> slurmd[node1-130-04]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 > >>> WITH SIGNAL 9 *** > >>> slurmd[node1-130-07]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 > >>> WITH SIGNAL 9 *** > >>> slurmd[node1-130-06]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 > >>> WITH SIGNAL 9 *** > >>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish. > >>> srun: error: node1-130-01: task 0: Exited with exit code 1 > >>> srun: error: node1-130-05: task 4: Exited with exit code 1 > >>> srun: error: node1-130-08: task 7: Exited with exit code 1 > >>> srun: error: node1-130-07: task 6: Exited with exit code 1 > >>> srun: error: node1-130-04: task 3: Killed > >>> srun: error: node1-130-06: task 5: Killed > >>> -------------------------------------------------------------------------- > >>> An ORTE daemon has unexpectedly failed after launch and before > >>> communicating back to mpirun. This could be caused by a number > >>> of factors, including an inability to create a connection back > >>> to mpirun due to a lack of common network interfaces and/or no > >>> route found between them. Please check network connectivity > >>> (including firewalls and network routing requirements). > >>> -------------------------------------------------------------------------- > >>> [compiler-2:08780] [[42202,0],0] orted_cmd: received halt_vm cmd > >>> [compiler-2:08780] mca: base: close: component oob closed > >>> [compiler-2:08780] mca: base: close: unloading component oob > >>> [compiler-2:08780] [[42202,0],0] TCP SHUTDOWN > >>> [compiler-2:08780] mca: base: close: component tcp closed > >>> [compiler-2:08780] mca: base: close: unloading component tcp > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/users/2014/08/24987.php > >>> > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/users/2014/08/24988.php > >> > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> For corporate legal information go to: > >> http://www.cisco.com/web/about/doing_business/legal/cri/ > >> > >> > >> > >> > >> > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2014/08/25001.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25086.php