Hello! As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i still have the problem
a) $ mpirun -np 1 ./hello_c -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- b) $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- c) $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c [compiler-2:14673] mca:base:select:( plm) Querying component [isolated] [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set priority to 0 [compiler-2:14673] mca:base:select:( plm) Querying component [rsh] [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set priority to 10 [compiler-2:14673] mca:base:select:( plm) Querying component [slurm] [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set priority to 75 [compiler-2:14673] mca:base:select:( plm) Selected component [slurm] [compiler-2:14673] mca: base: components_register: registering oob components [compiler-2:14673] mca: base: components_register: found loaded component tcp [compiler-2:14673] mca: base: components_register: component tcp register function successful [compiler-2:14673] mca: base: components_open: opening oob components [compiler-2:14673] mca: base: components_open: found loaded component tcp [compiler-2:14673] mca: base: components_open: component tcp open function successful [compiler-2:14673] mca:oob:select: checking available component tcp [compiler-2:14673] mca:oob:select: Querying component [tcp] [compiler-2:14673] oob:tcp: component_available called [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list of V4 connections [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 [compiler-2:14673] [[49095,0],0] TCP STARTUP [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0 [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460 [compiler-2:14673] mca:oob:select: Adding component to end [compiler-2:14673] mca:oob:select: Found 1 active transports [compiler-2:14673] mca: base: components_register: registering rml components [compiler-2:14673] mca: base: components_register: found loaded component oob [compiler-2:14673] mca: base: components_register: component oob has no register or open function [compiler-2:14673] mca: base: components_open: opening rml components [compiler-2:14673] mca: base: components_open: found loaded component oob [compiler-2:14673] mca: base: components_open: component oob open function successful [compiler-2:14673] orte_rml_base_select: initializing rml component oob [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 10 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 12 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 9 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 34 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 2 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 21 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 22 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 45 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 46 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 1 for peer [[WILDCARD],WILDCARD] [compiler-2:14673] [[49095,0],0] posting recv [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 27 for peer [[WILDCARD],WILDCARD] Daemon was launched on node1-128-01 - beginning to initialize -------------------------------------------------------------------------- WARNING: An invalid value was given for oob_tcp_if_include. This value will be ignored. Local host: node1-128-01 Value: "ib0" Message: Invalid specification (missing "/") -------------------------------------------------------------------------- -------------------------------------------------------------------------- None of the TCP networks specified to be included for out-of-band communications could be found: Value given: Please revise the specification and try again. -------------------------------------------------------------------------- -------------------------------------------------------------------------- No network interfaces were found for out-of-band communications. We require at least one available network for out-of-band messaging. -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_oob_base_select failed --> Returned value (null) (-43) instead of ORTE_SUCCESS -------------------------------------------------------------------------- srun: error: node1-128-01: task 0: Exited with exit code 213 srun: Terminating job step 661215.0 -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- [compiler-2:14673] [[49095,0],0] orted_cmd: received halt_vm cmd [compiler-2:14673] mca: base: close: component oob closed [compiler-2:14673] mca: base: close: unloading component oob [compiler-2:14673] [[49095,0],0] TCP SHUTDOWN [compiler-2:14673] mca: base: close: component tcp closed [compiler-2:14673] mca: base: close: unloading component tcp Tue, 12 Aug 2014 18:33:24 +0000 от "Jeff Squyres (jsquyres)" <jsquy...@cisco.com>: >I filed the following ticket: > > https://svn.open-mpi.org/trac/ompi/ticket/4857 > > >On Aug 12, 2014, at 12:39 PM, Jeff Squyres (jsquyres) < jsquy...@cisco.com > >wrote: > >> (please keep the users list CC'ed) >> >> We talked about this on the weekly engineering call today. Ralph has an >> idea what is happening -- I need to do a little investigation today and file >> a bug. I'll make sure you're CC'ed on the bug ticket. >> >> >> >> On Aug 12, 2014, at 12:27 PM, Timur Ismagilov < tismagi...@mail.ru > wrote: >> >>> I don't have this error in OMPI 1.9a1r32252 and OMPI 1.8.1 (with --mca >>> oob_tcp_if_include ib0), but in all latest night snapshots i got this error. >>> >>> >>> Tue, 12 Aug 2014 13:08:12 +0000 от "Jeff Squyres (jsquyres)" < >>> jsquy...@cisco.com >: >>> Are you running any kind of firewall on the node where mpirun is invoked? >>> Open MPI needs to be able to use arbitrary TCP ports between the servers on >>> which it runs. >>> >>> This second mail seems to imply a bug in OMPI's oob_tcp_if_include param >>> handling, however -- it's supposed to be able to handle an interface name >>> (not just a network specification). >>> >>> Ralph -- can you have a look? >>> >>> >>> On Aug 12, 2014, at 8:41 AM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>> >>>> When i add --mca oob_tcp_if_include ib0 (infiniband interface) to mpirun >>>> (as it was here: >>>> http://www.open-mpi.org/community/lists/users/2014/07/24857.php ) i got >>>> this output: >>>> >>>> [compiler-2:08792] mca:base:select:( plm) Querying component [isolated] >>>> [compiler-2:08792] mca:base:select:( plm) Query of component [isolated] >>>> set priority to 0 >>>> [compiler-2:08792] mca:base:select:( plm) Querying component [rsh] >>>> [compiler-2:08792] mca:base:select:( plm) Query of component [rsh] set >>>> priority to 10 >>>> [compiler-2:08792] mca:base:select:( plm) Querying component [slurm] >>>> [compiler-2:08792] mca:base:select:( plm) Query of component [slurm] set >>>> priority to 75 >>>> [compiler-2:08792] mca:base:select:( plm) Selected component [slurm] >>>> [compiler-2:08792] mca: base: components_register: registering oob >>>> components >>>> [compiler-2:08792] mca: base: components_register: found loaded component >>>> tcp >>>> [compiler-2:08792] mca: base: components_register: component tcp register >>>> function successful >>>> [compiler-2:08792] mca: base: components_open: opening oob components >>>> [compiler-2:08792] mca: base: components_open: found loaded component tcp >>>> [compiler-2:08792] mca: base: components_open: component tcp open function >>>> successful >>>> [compiler-2:08792] mca:oob:select: checking available component tcp >>>> [compiler-2:08792] mca:oob:select: Querying component [tcp] >>>> [compiler-2:08792] oob:tcp: component_available called >>>> [compiler-2:08792] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>> [compiler-2:08792] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>> [compiler-2:08792] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>> [compiler-2:08792] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>> [compiler-2:08792] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>> [compiler-2:08792] [[42190,0],0] oob:tcp:init adding 10.128.0.4 to our >>>> list of V4 connections >>>> [compiler-2:08792] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>> [compiler-2:08792] [[42190,0],0] TCP STARTUP >>>> [compiler-2:08792] [[42190,0],0] attempting to bind to IPv4 port 0 >>>> [compiler-2:08792] [[42190,0],0] assigned IPv4 port 53883 >>>> [compiler-2:08792] mca:oob:select: Adding component to end >>>> [compiler-2:08792] mca:oob:select: Found 1 active transports >>>> [compiler-2:08792] mca: base: components_register: registering rml >>>> components >>>> [compiler-2:08792] mca: base: components_register: found loaded component >>>> oob >>>> [compiler-2:08792] mca: base: components_register: component oob has no >>>> register or open function >>>> [compiler-2:08792] mca: base: components_open: opening rml components >>>> [compiler-2:08792] mca: base: components_open: found loaded component oob >>>> [compiler-2:08792] mca: base: components_open: component oob open function >>>> successful >>>> [compiler-2:08792] orte_rml_base_select: initializing rml component oob >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 30 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 15 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 32 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 33 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 5 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 10 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 12 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 9 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 34 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 2 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 21 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 22 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 45 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 46 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 1 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:08792] [[42190,0],0] posting recv >>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 27 for >>>> peer [[WILDCARD],WILDCARD] >>>> Daemon was launched on node1-128-01 - beginning to initialize >>>> Daemon was launched on node1-128-02 - beginning to initialize >>>> -------------------------------------------------------------------------- >>>> WARNING: An invalid value was given for oob_tcp_if_include. This >>>> value will be ignored. >>>> >>>> Local host: node1-128-01 >>>> Value: "ib0" >>>> Message: Invalid specification (missing "/") >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> WARNING: An invalid value was given for oob_tcp_if_include. This >>>> value will be ignored. >>>> >>>> Local host: node1-128-02 >>>> Value: "ib0" >>>> Message: Invalid specification (missing "/") >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> None of the TCP networks specified to be included for out-of-band >>>> communications >>>> could be found: >>>> >>>> Value given: >>>> >>>> Please revise the specification and try again. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> None of the TCP networks specified to be included for out-of-band >>>> communications >>>> could be found: >>>> >>>> Value given: >>>> >>>> Please revise the specification and try again. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> No network interfaces were found for out-of-band communications. We require >>>> at least one available network for out-of-band messaging. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> No network interfaces were found for out-of-band communications. We require >>>> at least one available network for out-of-band messaging. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> It looks like orte_init failed for some reason; your parallel process is >>>> likely to abort. There are many reasons that a parallel process can >>>> fail during orte_init; some of which are due to configuration or >>>> environment problems. This failure appears to be an internal failure; >>>> here's some additional information (which may only be relevant to an >>>> Open MPI developer): >>>> >>>> orte_oob_base_select failed >>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> It looks like orte_init failed for some reason; your parallel process is >>>> likely to abort. There are many reasons that a parallel process can >>>> fail during orte_init; some of which are due to configuration or >>>> environment problems. This failure appears to be an internal failure; >>>> here's some additional information (which may only be relevant to an >>>> Open MPI developer): >>>> >>>> orte_oob_base_select failed >>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>> -------------------------------------------------------------------------- >>>> srun: error: node1-128-02: task 1: Exited with exit code 213 >>>> srun: Terminating job step 657300.0 >>>> srun: error: node1-128-01: task 0: Exited with exit code 213 >>>> -------------------------------------------------------------------------- >>>> An ORTE daemon has unexpectedly failed after launch and before >>>> communicating back to mpirun. This could be caused by a number >>>> of factors, including an inability to create a connection back >>>> to mpirun due to a lack of common network interfaces and/or no >>>> route found between them. Please check network connectivity >>>> (including firewalls and network routing requirements). >>>> -------------------------------------------------------------------------- >>>> [compiler-2:08792] [[42190,0],0] orted_cmd: received halt_vm cmd >>>> [compiler-2:08792] mca: base: close: component oob closed >>>> [compiler-2:08792] mca: base: close: unloading component oob >>>> [compiler-2:08792] [[42190,0],0] TCP SHUTDOWN >>>> [compiler-2:08792] mca: base: close: component tcp closed >>>> [compiler-2:08792] mca: base: close: unloading component tcp >>>> >>>> >>>> >>>> Tue, 12 Aug 2014 16:14:58 +0400 от Timur Ismagilov < tismagi...@mail.ru >: >>>> Hello! >>>> >>>> I have Open MPI v1.8.2rc4r32485 >>>> >>>> When i run hello_c, I got this error message >>>> $mpirun -np 2 hello_c >>>> >>>> An ORTE daemon has unexpectedly failed after launch and before >>>> >>>> communicating back to mpirun. This could be caused by a number >>>> of factors, including an inability to create a connection back >>>> to mpirun due to a lack of common network interfaces and/or no >>>> route found between them. Please check network connectivity >>>> (including firewalls and network routing requirements). >>>> >>>> When i run with --debug-daemons --mca plm_base_verbose 5 -mca >>>> oob_base_verbose 10 -mca rml_base_verbose 10 i got this output: >>>> $mpirun --debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 10 >>>> -mca rml_base_verbose 10 -np 2 hello_c >>>> >>>> [compiler-2:08780] mca:base:select:( plm) Querying component [isolated] >>>> [compiler-2:08780] mca:base:select:( plm) Query of component [isolated] >>>> set priority to 0 >>>> [compiler-2:08780] mca:base:select:( plm) Querying component [rsh] >>>> [compiler-2:08780] mca:base:select:( plm) Query of component [rsh] set >>>> priority to 10 >>>> [compiler-2:08780] mca:base:select:( plm) Querying component [slurm] >>>> [compiler-2:08780] mca:base:select:( plm) Query of component [slurm] set >>>> priority to 75 >>>> [compiler-2:08780] mca:base:select:( plm) Selected component [slurm] >>>> [compiler-2:08780] mca: base: components_register: registering oob >>>> components >>>> [compiler-2:08780] mca: base: components_register: found loaded component >>>> tcp >>>> [compiler-2:08780] mca: base: components_register: component tcp register >>>> function successful >>>> [compiler-2:08780] mca: base: components_open: opening oob components >>>> [compiler-2:08780] mca: base: components_open: found loaded component tcp >>>> [compiler-2:08780] mca: base: components_open: component tcp open function >>>> successful >>>> [compiler-2:08780] mca:oob:select: checking available component tcp >>>> [compiler-2:08780] mca:oob:select: Querying component [tcp] >>>> [compiler-2:08780] oob:tcp: component_available called >>>> [compiler-2:08780] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>> [compiler-2:08780] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.251.53 to our >>>> list of V4 connections >>>> [compiler-2:08780] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.0.4 to our list >>>> of V4 connections >>>> [compiler-2:08780] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.2.251.14 to our >>>> list of V4 connections >>>> [compiler-2:08780] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.128.0.4 to our >>>> list of V4 connections >>>> [compiler-2:08780] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 93.180.7.38 to our >>>> list of V4 connections >>>> [compiler-2:08780] [[42202,0],0] TCP STARTUP >>>> [compiler-2:08780] [[42202,0],0] attempting to bind to IPv4 port 0 >>>> [compiler-2:08780] [[42202,0],0] assigned IPv4 port 38420 >>>> [compiler-2:08780] mca:oob:select: Adding component to end >>>> [compiler-2:08780] mca:oob:select: Found 1 active transports >>>> [compiler-2:08780] mca: base: components_register: registering rml >>>> components >>>> [compiler-2:08780] mca: base: components_register: found loaded component >>>> oob >>>> [compiler-2:08780] mca: base: components_register: component oob has no >>>> register or open function >>>> [compiler-2:08780] mca: base: components_open: opening rml components >>>> [compiler-2:08780] mca: base: components_open: found loaded component oob >>>> [compiler-2:08780] mca: base: components_open: component oob open function >>>> successful >>>> [compiler-2:08780] orte_rml_base_select: initializing rml component oob >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 30 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 15 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 32 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 33 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 5 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 10 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 12 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 9 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 34 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 2 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 21 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 22 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 45 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 46 for >>>> peer [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 1 for peer >>>> [[WILDCARD],WILDCARD] >>>> [compiler-2:08780] [[42202,0],0] posting recv >>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 27 for >>>> peer [[WILDCARD],WILDCARD] >>>> Daemon was launched on node1-130-08 - beginning to initialize >>>> Daemon was launched on node1-130-03 - beginning to initialize >>>> Daemon was launched on node1-130-05 - beginning to initialize >>>> Daemon was launched on node1-130-02 - beginning to initialize >>>> Daemon was launched on node1-130-01 - beginning to initialize >>>> Daemon was launched on node1-130-04 - beginning to initialize >>>> Daemon was launched on node1-130-07 - beginning to initialize >>>> Daemon was launched on node1-130-06 - beginning to initialize >>>> Daemon [[42202,0],3] checking in as pid 7178 on host node1-130-03 >>>> [node1-130-03:07178] [[42202,0],3] orted: up and running - waiting for >>>> commands! >>>> Daemon [[42202,0],2] checking in as pid 13581 on host node1-130-02 >>>> [node1-130-02:13581] [[42202,0],2] orted: up and running - waiting for >>>> commands! >>>> Daemon [[42202,0],1] checking in as pid 17220 on host node1-130-01 >>>> [node1-130-01:17220] [[42202,0],1] orted: up and running - waiting for >>>> commands! >>>> Daemon [[42202,0],5] checking in as pid 6663 on host node1-130-05 >>>> [node1-130-05:06663] [[42202,0],5] orted: up and running - waiting for >>>> commands! >>>> Daemon [[42202,0],8] checking in as pid 6683 on host node1-130-08 >>>> [node1-130-08:06683] [[42202,0],8] orted: up and running - waiting for >>>> commands! >>>> Daemon [[42202,0],7] checking in as pid 7877 on host node1-130-07 >>>> [node1-130-07:07877] [[42202,0],7] orted: up and running - waiting for >>>> commands! >>>> Daemon [[42202,0],4] checking in as pid 7735 on host node1-130-04 >>>> [node1-130-04:07735] [[42202,0],4] orted: up and running - waiting for >>>> commands! >>>> Daemon [[42202,0],6] checking in as pid 8451 on host node1-130-06 >>>> [node1-130-06:08451] [[42202,0],6] orted: up and running - waiting for >>>> commands! >>>> srun: error: node1-130-03: task 2: Exited with exit code 1 >>>> srun: Terminating job step 657040.1 >>>> srun: error: node1-130-02: task 1: Exited with exit code 1 >>>> slurmd[node1-130-04]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 WITH >>>> SIGNAL 9 *** >>>> slurmd[node1-130-07]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 WITH >>>> SIGNAL 9 *** >>>> slurmd[node1-130-06]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 WITH >>>> SIGNAL 9 *** >>>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish. >>>> srun: error: node1-130-01: task 0: Exited with exit code 1 >>>> srun: error: node1-130-05: task 4: Exited with exit code 1 >>>> srun: error: node1-130-08: task 7: Exited with exit code 1 >>>> srun: error: node1-130-07: task 6: Exited with exit code 1 >>>> srun: error: node1-130-04: task 3: Killed >>>> srun: error: node1-130-06: task 5: Killed >>>> -------------------------------------------------------------------------- >>>> An ORTE daemon has unexpectedly failed after launch and before >>>> communicating back to mpirun. This could be caused by a number >>>> of factors, including an inability to create a connection back >>>> to mpirun due to a lack of common network interfaces and/or no >>>> route found between them. Please check network connectivity >>>> (including firewalls and network routing requirements). >>>> -------------------------------------------------------------------------- >>>> [compiler-2:08780] [[42202,0],0] orted_cmd: received halt_vm cmd >>>> [compiler-2:08780] mca: base: close: component oob closed >>>> [compiler-2:08780] mca: base: close: unloading component oob >>>> [compiler-2:08780] [[42202,0],0] TCP SHUTDOWN >>>> [compiler-2:08780] mca: base: close: component tcp closed >>>> [compiler-2:08780] mca: base: close: unloading component tcp >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/24987.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/24988.php >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> >>> >>> >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25001.php > > >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/ > >