Have i I any opportunity to run mpi jobs?
Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain <r...@open-mpi.org>: >yes, i know - it is cmr'd > >On Aug 20, 2014, at 10:26 AM, Mike Dubman < mi...@dev.mellanox.co.il > wrote: >>btw, we get same error in v1.8 branch as well. >> >> >>On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain < r...@open-mpi.org > wrote: >>>It was not yet fixed - but should be now. >>> >>>On Aug 20, 2014, at 6:39 AM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>>>Hello! >>>> >>>>As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i still have >>>>the problem >>>> >>>>a) >>>>$ mpirun -np 1 ./hello_c >>>>-------------------------------------------------------------------------- >>>>An ORTE daemon has unexpectedly failed after launch and before >>>>communicating back to mpirun. This could be caused by a number >>>>of factors, including an inability to create a connection back >>>>to mpirun due to a lack of common network interfaces and/or no >>>>route found between them. Please check network connectivity >>>>(including firewalls and network routing requirements). >>>>-------------------------------------------------------------------------- >>>>b) >>>>$ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c >>>>-------------------------------------------------------------------------- >>>>An ORTE daemon has unexpectedly failed after launch and before >>>>communicating back to mpirun. This could be caused by a number >>>>of factors, including an inability to create a connection back >>>>to mpirun due to a lack of common network interfaces and/or no >>>>route found between them. Please check network connectivity >>>>(including firewalls and network routing requirements). >>>>-------------------------------------------------------------------------- >>>> >>>>c) >>>> >>>>$ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose >>>>5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c >>>>[compiler-2:14673] mca:base:select:( plm) Querying component [isolated] >>>>[compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set >>>>priority to 0 >>>>[compiler-2:14673] mca:base:select:( plm) Querying component [rsh] >>>>[compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set >>>>priority to 10 >>>>[compiler-2:14673] mca:base:select:( plm) Querying component [slurm] >>>>[compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set >>>>priority to 75 >>>>[compiler-2:14673] mca:base:select:( plm) Selected component [slurm] >>>>[compiler-2:14673] mca: base: components_register: registering oob >>>>components >>>>[compiler-2:14673] mca: base: components_register: found loaded component >>>>tcp >>>>[compiler-2:14673] mca: base: components_register: component tcp register >>>>function successful >>>>[compiler-2:14673] mca: base: components_open: opening oob components >>>>[compiler-2:14673] mca: base: components_open: found loaded component tcp >>>>[compiler-2:14673] mca: base: components_open: component tcp open function >>>>successful >>>>[compiler-2:14673] mca:oob:select: checking available component tcp >>>>[compiler-2:14673] mca:oob:select: Querying component [tcp] >>>>[compiler-2:14673] oob:tcp: component_available called >>>>[compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>>[compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>>[compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>>[compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>>[compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>>[compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list >>>>of V4 connections >>>>[compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>>[compiler-2:14673] [[49095,0],0] TCP STARTUP >>>>[compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0 >>>>[compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460 >>>>[compiler-2:14673] mca:oob:select: Adding component to end >>>>[compiler-2:14673] mca:oob:select: Found 1 active transports >>>>[compiler-2:14673] mca: base: components_register: registering rml >>>>components >>>>[compiler-2:14673] mca: base: components_register: found loaded component >>>>oob >>>>[compiler-2:14673] mca: base: components_register: component oob has no >>>>register or open function >>>>[compiler-2:14673] mca: base: components_open: opening rml components >>>>[compiler-2:14673] mca: base: components_open: found loaded component oob >>>>[compiler-2:14673] mca: base: components_open: component oob open function >>>>successful >>>>[compiler-2:14673] orte_rml_base_select: initializing rml component oob >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 10 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 12 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 9 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 34 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 2 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 21 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 22 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 45 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 46 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 1 for peer >>>>[[WILDCARD],WILDCARD] >>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 27 for peer >>>>[[WILDCARD],WILDCARD] >>>>Daemon was launched on node1-128-01 - beginning to initialize >>>>-------------------------------------------------------------------------- >>>>WARNING: An invalid value was given for oob_tcp_if_include. This >>>>value will be ignored. >>>>Local host: node1-128-01 >>>>Value: "ib0" >>>>Message: Invalid specification (missing "/") >>>>-------------------------------------------------------------------------- >>>>-------------------------------------------------------------------------- >>>>None of the TCP networks specified to be included for out-of-band >>>>communications >>>>could be found: >>>>Value given: >>>>Please revise the specification and try again. >>>>-------------------------------------------------------------------------- >>>>-------------------------------------------------------------------------- >>>>No network interfaces were found for out-of-band communications. We require >>>>at least one available network for out-of-band messaging. >>>>-------------------------------------------------------------------------- >>>>-------------------------------------------------------------------------- >>>>It looks like orte_init failed for some reason; your parallel process is >>>>likely to abort. There are many reasons that a parallel process can >>>>fail during orte_init; some of which are due to configuration or >>>>environment problems. This failure appears to be an internal failure; >>>>here's some additional information (which may only be relevant to an >>>>Open MPI developer): >>>>orte_oob_base_select failed >>>>--> Returned value (null) (-43) instead of ORTE_SUCCESS >>>>-------------------------------------------------------------------------- >>>>srun: error: node1-128-01: task 0: Exited with exit code 213 >>>>srun: Terminating job step 661215.0 >>>>-------------------------------------------------------------------------- >>>>An ORTE daemon has unexpectedly failed after launch and before >>>>communicating back to mpirun. This could be caused by a number >>>>of factors, including an inability to create a connection back >>>>to mpirun due to a lack of common network interfaces and/or no >>>>route found between them. Please check network connectivity >>>>(including firewalls and network routing requirements). >>>>-------------------------------------------------------------------------- >>>>[compiler-2:14673] [[49095,0],0] orted_cmd: received halt_vm cmd >>>>[compiler-2:14673] mca: base: close: component oob closed >>>>[compiler-2:14673] mca: base: close: unloading component oob >>>>[compiler-2:14673] [[49095,0],0] TCP SHUTDOWN >>>>[compiler-2:14673] mca: base: close: component tcp closed >>>>[compiler-2:14673] mca: base: close: unloading component tcp >>>> >>>> >>>>Tue, 12 Aug 2014 18:33:24 +0000 от "Jeff Squyres (jsquyres)" < >>>>jsquy...@cisco.com >: >>>>>I filed the following ticket: >>>>> >>>>> https://svn.open-mpi.org/trac/ompi/ticket/4857 >>>>> >>>>> >>>>>On Aug 12, 2014, at 12:39 PM, Jeff Squyres (jsquyres) < jsquy...@cisco.com >>>>>> wrote: >>>>> >>>>>> (please keep the users list CC'ed) >>>>>> >>>>>> We talked about this on the weekly engineering call today. Ralph has an >>>>>> idea what is happening -- I need to do a little investigation today and >>>>>> file a bug. I'll make sure you're CC'ed on the bug ticket. >>>>>> >>>>>> >>>>>> >>>>>> On Aug 12, 2014, at 12:27 PM, Timur Ismagilov < tismagi...@mail.ru > >>>>>> wrote: >>>>>> >>>>>>> I don't have this error in OMPI 1.9a1r32252 and OMPI 1.8.1 (with --mca >>>>>>> oob_tcp_if_include ib0), but in all latest night snapshots i got this >>>>>>> error. >>>>>>> >>>>>>> >>>>>>> Tue, 12 Aug 2014 13:08:12 +0000 от "Jeff Squyres (jsquyres)" < >>>>>>> jsquy...@cisco.com >: >>>>>>> Are you running any kind of firewall on the node where mpirun is >>>>>>> invoked? Open MPI needs to be able to use arbitrary TCP ports between >>>>>>> the servers on which it runs. >>>>>>> >>>>>>> This second mail seems to imply a bug in OMPI's oob_tcp_if_include >>>>>>> param handling, however -- it's supposed to be able to handle an >>>>>>> interface name (not just a network specification). >>>>>>> >>>>>>> Ralph -- can you have a look? >>>>>>> >>>>>>> >>>>>>> On Aug 12, 2014, at 8:41 AM, Timur Ismagilov < tismagi...@mail.ru > >>>>>>> wrote: >>>>>>> >>>>>>>> When i add --mca oob_tcp_if_include ib0 (infiniband interface) to >>>>>>>> mpirun (as it was here: >>>>>>>> http://www.open-mpi.org/community/lists/users/2014/07/24857.php ) i >>>>>>>> got this output: >>>>>>>> >>>>>>>> [compiler-2:08792] mca:base:select:( plm) Querying component [isolated] >>>>>>>> [compiler-2:08792] mca:base:select:( plm) Query of component >>>>>>>> [isolated] set priority to 0 >>>>>>>> [compiler-2:08792] mca:base:select:( plm) Querying component [rsh] >>>>>>>> [compiler-2:08792] mca:base:select:( plm) Query of component [rsh] set >>>>>>>> priority to 10 >>>>>>>> [compiler-2:08792] mca:base:select:( plm) Querying component [slurm] >>>>>>>> [compiler-2:08792] mca:base:select:( plm) Query of component [slurm] >>>>>>>> set priority to 75 >>>>>>>> [compiler-2:08792] mca:base:select:( plm) Selected component [slurm] >>>>>>>> [compiler-2:08792] mca: base: components_register: registering oob >>>>>>>> components >>>>>>>> [compiler-2:08792] mca: base: components_register: found loaded >>>>>>>> component tcp >>>>>>>> [compiler-2:08792] mca: base: components_register: component tcp >>>>>>>> register function successful >>>>>>>> [compiler-2:08792] mca: base: components_open: opening oob components >>>>>>>> [compiler-2:08792] mca: base: components_open: found loaded component >>>>>>>> tcp >>>>>>>> [compiler-2:08792] mca: base: components_open: component tcp open >>>>>>>> function successful >>>>>>>> [compiler-2:08792] mca:oob:select: checking available component tcp >>>>>>>> [compiler-2:08792] mca:oob:select: Querying component [tcp] >>>>>>>> [compiler-2:08792] oob:tcp: component_available called >>>>>>>> [compiler-2:08792] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>>>>>> [compiler-2:08792] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>>>>>> [compiler-2:08792] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>>>>>> [compiler-2:08792] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>>>>>> [compiler-2:08792] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>>>>>> [compiler-2:08792] [[42190,0],0] oob:tcp:init adding 10.128.0.4 to our >>>>>>>> list of V4 connections >>>>>>>> [compiler-2:08792] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>>>>>> [compiler-2:08792] [[42190,0],0] TCP STARTUP >>>>>>>> [compiler-2:08792] [[42190,0],0] attempting to bind to IPv4 port 0 >>>>>>>> [compiler-2:08792] [[42190,0],0] assigned IPv4 port 53883 >>>>>>>> [compiler-2:08792] mca:oob:select: Adding component to end >>>>>>>> [compiler-2:08792] mca:oob:select: Found 1 active transports >>>>>>>> [compiler-2:08792] mca: base: components_register: registering rml >>>>>>>> components >>>>>>>> [compiler-2:08792] mca: base: components_register: found loaded >>>>>>>> component oob >>>>>>>> [compiler-2:08792] mca: base: components_register: component oob has >>>>>>>> no register or open function >>>>>>>> [compiler-2:08792] mca: base: components_open: opening rml components >>>>>>>> [compiler-2:08792] mca: base: components_open: found loaded component >>>>>>>> oob >>>>>>>> [compiler-2:08792] mca: base: components_open: component oob open >>>>>>>> function successful >>>>>>>> [compiler-2:08792] orte_rml_base_select: initializing rml component oob >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 30 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 15 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 32 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 33 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 5 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 10 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 12 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 9 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 34 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 2 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 21 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 22 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 45 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 46 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 1 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 27 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> Daemon was launched on node1-128-01 - beginning to initialize >>>>>>>> Daemon was launched on node1-128-02 - beginning to initialize >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> WARNING: An invalid value was given for oob_tcp_if_include. This >>>>>>>> value will be ignored. >>>>>>>> >>>>>>>> Local host: node1-128-01 >>>>>>>> Value: "ib0" >>>>>>>> Message: Invalid specification (missing "/") >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> WARNING: An invalid value was given for oob_tcp_if_include. This >>>>>>>> value will be ignored. >>>>>>>> >>>>>>>> Local host: node1-128-02 >>>>>>>> Value: "ib0" >>>>>>>> Message: Invalid specification (missing "/") >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> None of the TCP networks specified to be included for out-of-band >>>>>>>> communications >>>>>>>> could be found: >>>>>>>> >>>>>>>> Value given: >>>>>>>> >>>>>>>> Please revise the specification and try again. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> None of the TCP networks specified to be included for out-of-band >>>>>>>> communications >>>>>>>> could be found: >>>>>>>> >>>>>>>> Value given: >>>>>>>> >>>>>>>> Please revise the specification and try again. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> No network interfaces were found for out-of-band communications. We >>>>>>>> require >>>>>>>> at least one available network for out-of-band messaging. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> No network interfaces were found for out-of-band communications. We >>>>>>>> require >>>>>>>> at least one available network for out-of-band messaging. >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> It looks like orte_init failed for some reason; your parallel process >>>>>>>> is >>>>>>>> likely to abort. There are many reasons that a parallel process can >>>>>>>> fail during orte_init; some of which are due to configuration or >>>>>>>> environment problems. This failure appears to be an internal failure; >>>>>>>> here's some additional information (which may only be relevant to an >>>>>>>> Open MPI developer): >>>>>>>> >>>>>>>> orte_oob_base_select failed >>>>>>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> It looks like orte_init failed for some reason; your parallel process >>>>>>>> is >>>>>>>> likely to abort. There are many reasons that a parallel process can >>>>>>>> fail during orte_init; some of which are due to configuration or >>>>>>>> environment problems. This failure appears to be an internal failure; >>>>>>>> here's some additional information (which may only be relevant to an >>>>>>>> Open MPI developer): >>>>>>>> >>>>>>>> orte_oob_base_select failed >>>>>>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> srun: error: node1-128-02: task 1: Exited with exit code 213 >>>>>>>> srun: Terminating job step 657300.0 >>>>>>>> srun: error: node1-128-01: task 0: Exited with exit code 213 >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>>> communicating back to mpirun. This could be caused by a number >>>>>>>> of factors, including an inability to create a connection back >>>>>>>> to mpirun due to a lack of common network interfaces and/or no >>>>>>>> route found between them. Please check network connectivity >>>>>>>> (including firewalls and network routing requirements). >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> [compiler-2:08792] [[42190,0],0] orted_cmd: received halt_vm cmd >>>>>>>> [compiler-2:08792] mca: base: close: component oob closed >>>>>>>> [compiler-2:08792] mca: base: close: unloading component oob >>>>>>>> [compiler-2:08792] [[42190,0],0] TCP SHUTDOWN >>>>>>>> [compiler-2:08792] mca: base: close: component tcp closed >>>>>>>> [compiler-2:08792] mca: base: close: unloading component tcp >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Tue, 12 Aug 2014 16:14:58 +0400 от Timur Ismagilov < >>>>>>>> tismagi...@mail.ru >: >>>>>>>> Hello! >>>>>>>> >>>>>>>> I have Open MPI v1.8.2rc4r32485 >>>>>>>> >>>>>>>> When i run hello_c, I got this error message >>>>>>>> $mpirun -np 2 hello_c >>>>>>>> >>>>>>>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>>> >>>>>>>> communicating back to mpirun. This could be caused by a number >>>>>>>> of factors, including an inability to create a connection back >>>>>>>> to mpirun due to a lack of common network interfaces and/or no >>>>>>>> route found between them. Please check network connectivity >>>>>>>> (including firewalls and network routing requirements). >>>>>>>> >>>>>>>> When i run with --debug-daemons --mca plm_base_verbose 5 -mca >>>>>>>> oob_base_verbose 10 -mca rml_base_verbose 10 i got this output: >>>>>>>> $mpirun --debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose >>>>>>>> 10 -mca rml_base_verbose 10 -np 2 hello_c >>>>>>>> >>>>>>>> [compiler-2:08780] mca:base:select:( plm) Querying component [isolated] >>>>>>>> [compiler-2:08780] mca:base:select:( plm) Query of component >>>>>>>> [isolated] set priority to 0 >>>>>>>> [compiler-2:08780] mca:base:select:( plm) Querying component [rsh] >>>>>>>> [compiler-2:08780] mca:base:select:( plm) Query of component [rsh] set >>>>>>>> priority to 10 >>>>>>>> [compiler-2:08780] mca:base:select:( plm) Querying component [slurm] >>>>>>>> [compiler-2:08780] mca:base:select:( plm) Query of component [slurm] >>>>>>>> set priority to 75 >>>>>>>> [compiler-2:08780] mca:base:select:( plm) Selected component [slurm] >>>>>>>> [compiler-2:08780] mca: base: components_register: registering oob >>>>>>>> components >>>>>>>> [compiler-2:08780] mca: base: components_register: found loaded >>>>>>>> component tcp >>>>>>>> [compiler-2:08780] mca: base: components_register: component tcp >>>>>>>> register function successful >>>>>>>> [compiler-2:08780] mca: base: components_open: opening oob components >>>>>>>> [compiler-2:08780] mca: base: components_open: found loaded component >>>>>>>> tcp >>>>>>>> [compiler-2:08780] mca: base: components_open: component tcp open >>>>>>>> function successful >>>>>>>> [compiler-2:08780] mca:oob:select: checking available component tcp >>>>>>>> [compiler-2:08780] mca:oob:select: Querying component [tcp] >>>>>>>> [compiler-2:08780] oob:tcp: component_available called >>>>>>>> [compiler-2:08780] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>>>>>> [compiler-2:08780] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.251.53 to >>>>>>>> our list of V4 connections >>>>>>>> [compiler-2:08780] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.0.4 to our >>>>>>>> list of V4 connections >>>>>>>> [compiler-2:08780] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.2.251.14 to >>>>>>>> our list of V4 connections >>>>>>>> [compiler-2:08780] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.128.0.4 to our >>>>>>>> list of V4 connections >>>>>>>> [compiler-2:08780] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 93.180.7.38 to >>>>>>>> our list of V4 connections >>>>>>>> [compiler-2:08780] [[42202,0],0] TCP STARTUP >>>>>>>> [compiler-2:08780] [[42202,0],0] attempting to bind to IPv4 port 0 >>>>>>>> [compiler-2:08780] [[42202,0],0] assigned IPv4 port 38420 >>>>>>>> [compiler-2:08780] mca:oob:select: Adding component to end >>>>>>>> [compiler-2:08780] mca:oob:select: Found 1 active transports >>>>>>>> [compiler-2:08780] mca: base: components_register: registering rml >>>>>>>> components >>>>>>>> [compiler-2:08780] mca: base: components_register: found loaded >>>>>>>> component oob >>>>>>>> [compiler-2:08780] mca: base: components_register: component oob has >>>>>>>> no register or open function >>>>>>>> [compiler-2:08780] mca: base: components_open: opening rml components >>>>>>>> [compiler-2:08780] mca: base: components_open: found loaded component >>>>>>>> oob >>>>>>>> [compiler-2:08780] mca: base: components_open: component oob open >>>>>>>> function successful >>>>>>>> [compiler-2:08780] orte_rml_base_select: initializing rml component oob >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 30 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 15 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 32 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 33 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 5 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 10 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 12 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 9 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 34 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 2 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 21 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 22 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 45 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 46 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 1 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 27 for >>>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>>> Daemon was launched on node1-130-08 - beginning to initialize >>>>>>>> Daemon was launched on node1-130-03 - beginning to initialize >>>>>>>> Daemon was launched on node1-130-05 - beginning to initialize >>>>>>>> Daemon was launched on node1-130-02 - beginning to initialize >>>>>>>> Daemon was launched on node1-130-01 - beginning to initialize >>>>>>>> Daemon was launched on node1-130-04 - beginning to initialize >>>>>>>> Daemon was launched on node1-130-07 - beginning to initialize >>>>>>>> Daemon was launched on node1-130-06 - beginning to initialize >>>>>>>> Daemon [[42202,0],3] checking in as pid 7178 on host node1-130-03 >>>>>>>> [node1-130-03:07178] [[42202,0],3] orted: up and running - waiting for >>>>>>>> commands! >>>>>>>> Daemon [[42202,0],2] checking in as pid 13581 on host node1-130-02 >>>>>>>> [node1-130-02:13581] [[42202,0],2] orted: up and running - waiting for >>>>>>>> commands! >>>>>>>> Daemon [[42202,0],1] checking in as pid 17220 on host node1-130-01 >>>>>>>> [node1-130-01:17220] [[42202,0],1] orted: up and running - waiting for >>>>>>>> commands! >>>>>>>> Daemon [[42202,0],5] checking in as pid 6663 on host node1-130-05 >>>>>>>> [node1-130-05:06663] [[42202,0],5] orted: up and running - waiting for >>>>>>>> commands! >>>>>>>> Daemon [[42202,0],8] checking in as pid 6683 on host node1-130-08 >>>>>>>> [node1-130-08:06683] [[42202,0],8] orted: up and running - waiting for >>>>>>>> commands! >>>>>>>> Daemon [[42202,0],7] checking in as pid 7877 on host node1-130-07 >>>>>>>> [node1-130-07:07877] [[42202,0],7] orted: up and running - waiting for >>>>>>>> commands! >>>>>>>> Daemon [[42202,0],4] checking in as pid 7735 on host node1-130-04 >>>>>>>> [node1-130-04:07735] [[42202,0],4] orted: up and running - waiting for >>>>>>>> commands! >>>>>>>> Daemon [[42202,0],6] checking in as pid 8451 on host node1-130-06 >>>>>>>> [node1-130-06:08451] [[42202,0],6] orted: up and running - waiting for >>>>>>>> commands! >>>>>>>> srun: error: node1-130-03: task 2: Exited with exit code 1 >>>>>>>> srun: Terminating job step 657040.1 >>>>>>>> srun: error: node1-130-02: task 1: Exited with exit code 1 >>>>>>>> slurmd[node1-130-04]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 >>>>>>>> WITH SIGNAL 9 *** >>>>>>>> slurmd[node1-130-07]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 >>>>>>>> WITH SIGNAL 9 *** >>>>>>>> slurmd[node1-130-06]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 >>>>>>>> WITH SIGNAL 9 *** >>>>>>>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish. >>>>>>>> srun: error: node1-130-01: task 0: Exited with exit code 1 >>>>>>>> srun: error: node1-130-05: task 4: Exited with exit code 1 >>>>>>>> srun: error: node1-130-08: task 7: Exited with exit code 1 >>>>>>>> srun: error: node1-130-07: task 6: Exited with exit code 1 >>>>>>>> srun: error: node1-130-04: task 3: Killed >>>>>>>> srun: error: node1-130-06: task 5: Killed >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>>> communicating back to mpirun. This could be caused by a number >>>>>>>> of factors, including an inability to create a connection back >>>>>>>> to mpirun due to a lack of common network interfaces and/or no >>>>>>>> route found between them. Please check network connectivity >>>>>>>> (including firewalls and network routing requirements). >>>>>>>> -------------------------------------------------------------------------- >>>>>>>> [compiler-2:08780] [[42202,0],0] orted_cmd: received halt_vm cmd >>>>>>>> [compiler-2:08780] mca: base: close: component oob closed >>>>>>>> [compiler-2:08780] mca: base: close: unloading component oob >>>>>>>> [compiler-2:08780] [[42202,0],0] TCP SHUTDOWN >>>>>>>> [compiler-2:08780] mca: base: close: component tcp closed >>>>>>>> [compiler-2:08780] mca: base: close: unloading component tcp >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/24987.php >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/24988.php >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jeff Squyres >>>>>>> jsquy...@cisco.com >>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Jeff Squyres >>>>>> jsquy...@cisco.com >>>>>> For corporate legal information go to: >>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25001.php >>>>> >>>>> >>>>>-- >>>>>Jeff Squyres >>>>>jsquy...@cisco.com >>>>>For corporate legal information go to: >>>>>http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> >>>> >>>> >>>> >>>>_______________________________________________ >>>>users mailing list >>>>us...@open-mpi.org >>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>Link to this post: >>>>http://www.open-mpi.org/community/lists/users/2014/08/25086.php >>> >>>_______________________________________________ >>>users mailing list >>>us...@open-mpi.org >>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>Link to this post: >>>http://www.open-mpi.org/community/lists/users/2014/08/25093.php >> >>_______________________________________________ >>users mailing list >>us...@open-mpi.org >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>Link to this post: >>http://www.open-mpi.org/community/lists/users/2014/08/25094.php >_______________________________________________ >users mailing list >us...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >Link to this post: >http://www.open-mpi.org/community/lists/users/2014/08/25095.php