Have i I any opportunity to run mpi jobs?

Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain <r...@open-mpi.org>:
>yes, i know - it is cmr'd
>
>On Aug 20, 2014, at 10:26 AM, Mike Dubman < mi...@dev.mellanox.co.il > wrote:
>>btw, we get same error in v1.8 branch as well.
>>
>>
>>On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain  < r...@open-mpi.org > wrote:
>>>It was not yet fixed - but should be now.
>>>
>>>On Aug 20, 2014, at 6:39 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>>>Hello!
>>>>
>>>>As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have 
>>>>the problem
>>>>
>>>>a)
>>>>$ mpirun  -np 1 ./hello_c
>>>>--------------------------------------------------------------------------
>>>>An ORTE daemon has unexpectedly failed after launch and before
>>>>communicating back to mpirun. This could be caused by a number
>>>>of factors, including an inability to create a connection back
>>>>to mpirun due to a lack of common network interfaces and/or no
>>>>route found between them. Please check network connectivity
>>>>(including firewalls and network routing requirements).
>>>>--------------------------------------------------------------------------
>>>>b)
>>>>$ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>>>>--------------------------------------------------------------------------
>>>>An ORTE daemon has unexpectedly failed after launch and before
>>>>communicating back to mpirun. This could be caused by a number
>>>>of factors, including an inability to create a connection back
>>>>to mpirun due to a lack of common network interfaces and/or no
>>>>route found between them. Please check network connectivity
>>>>(including firewalls and network routing requirements).
>>>>--------------------------------------------------------------------------
>>>>
>>>>c)
>>>>
>>>>$ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose 
>>>>5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c
>>>>[compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
>>>>[compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set 
>>>>priority to 0
>>>>[compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
>>>>[compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set 
>>>>priority to 10
>>>>[compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
>>>>[compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set 
>>>>priority to 75
>>>>[compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
>>>>[compiler-2:14673] mca: base: components_register: registering oob 
>>>>components
>>>>[compiler-2:14673] mca: base: components_register: found loaded component 
>>>>tcp
>>>>[compiler-2:14673] mca: base: components_register: component tcp register 
>>>>function successful
>>>>[compiler-2:14673] mca: base: components_open: opening oob components
>>>>[compiler-2:14673] mca: base: components_open: found loaded component tcp
>>>>[compiler-2:14673] mca: base: components_open: component tcp open function 
>>>>successful
>>>>[compiler-2:14673] mca:oob:select: checking available component tcp
>>>>[compiler-2:14673] mca:oob:select: Querying component [tcp]
>>>>[compiler-2:14673] oob:tcp: component_available called
>>>>[compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>>>>[compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
>>>>[compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>>>>[compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
>>>>[compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
>>>>[compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list 
>>>>of V4 connections
>>>>[compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
>>>>[compiler-2:14673] [[49095,0],0] TCP STARTUP
>>>>[compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0
>>>>[compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460
>>>>[compiler-2:14673] mca:oob:select: Adding component to end
>>>>[compiler-2:14673] mca:oob:select: Found 1 active transports
>>>>[compiler-2:14673] mca: base: components_register: registering rml 
>>>>components
>>>>[compiler-2:14673] mca: base: components_register: found loaded component 
>>>>oob
>>>>[compiler-2:14673] mca: base: components_register: component oob has no 
>>>>register or open function
>>>>[compiler-2:14673] mca: base: components_open: opening rml components
>>>>[compiler-2:14673] mca: base: components_open: found loaded component oob
>>>>[compiler-2:14673] mca: base: components_open: component oob open function 
>>>>successful
>>>>[compiler-2:14673] orte_rml_base_select: initializing rml component oob
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 10 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 12 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 9 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 34 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 2 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 21 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 22 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 45 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 46 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 1 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>[compiler-2:14673] [[49095,0],0] posting recv
>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 27 for peer 
>>>>[[WILDCARD],WILDCARD]
>>>>Daemon was launched on node1-128-01 - beginning to initialize
>>>>--------------------------------------------------------------------------
>>>>WARNING: An invalid value was given for oob_tcp_if_include. This
>>>>value will be ignored.
>>>>Local host: node1-128-01
>>>>Value: "ib0"
>>>>Message: Invalid specification (missing "/")
>>>>--------------------------------------------------------------------------
>>>>--------------------------------------------------------------------------
>>>>None of the TCP networks specified to be included for out-of-band 
>>>>communications
>>>>could be found:
>>>>Value given:
>>>>Please revise the specification and try again.
>>>>--------------------------------------------------------------------------
>>>>--------------------------------------------------------------------------
>>>>No network interfaces were found for out-of-band communications. We require
>>>>at least one available network for out-of-band messaging.
>>>>--------------------------------------------------------------------------
>>>>--------------------------------------------------------------------------
>>>>It looks like orte_init failed for some reason; your parallel process is
>>>>likely to abort. There are many reasons that a parallel process can
>>>>fail during orte_init; some of which are due to configuration or
>>>>environment problems. This failure appears to be an internal failure;
>>>>here's some additional information (which may only be relevant to an
>>>>Open MPI developer):
>>>>orte_oob_base_select failed
>>>>--> Returned value (null) (-43) instead of ORTE_SUCCESS
>>>>--------------------------------------------------------------------------
>>>>srun: error: node1-128-01: task 0: Exited with exit code 213
>>>>srun: Terminating job step 661215.0
>>>>--------------------------------------------------------------------------
>>>>An ORTE daemon has unexpectedly failed after launch and before
>>>>communicating back to mpirun. This could be caused by a number
>>>>of factors, including an inability to create a connection back
>>>>to mpirun due to a lack of common network interfaces and/or no
>>>>route found between them. Please check network connectivity
>>>>(including firewalls and network routing requirements).
>>>>--------------------------------------------------------------------------
>>>>[compiler-2:14673] [[49095,0],0] orted_cmd: received halt_vm cmd
>>>>[compiler-2:14673] mca: base: close: component oob closed
>>>>[compiler-2:14673] mca: base: close: unloading component oob
>>>>[compiler-2:14673] [[49095,0],0] TCP SHUTDOWN
>>>>[compiler-2:14673] mca: base: close: component tcp closed
>>>>[compiler-2:14673] mca: base: close: unloading component tcp
>>>>
>>>>
>>>>Tue, 12 Aug 2014 18:33:24 +0000 от "Jeff Squyres (jsquyres)" < 
>>>>jsquy...@cisco.com >:
>>>>>I filed the following ticket:
>>>>>
>>>>>     https://svn.open-mpi.org/trac/ompi/ticket/4857
>>>>>
>>>>>
>>>>>On Aug 12, 2014, at 12:39 PM, Jeff Squyres (jsquyres) < jsquy...@cisco.com 
>>>>>> wrote:
>>>>>
>>>>>> (please keep the users list CC'ed)
>>>>>>  
>>>>>> We talked about this on the weekly engineering call today. Ralph has an 
>>>>>> idea what is happening -- I need to do a little investigation today and 
>>>>>> file a bug. I'll make sure you're CC'ed on the bug ticket.
>>>>>>  
>>>>>>  
>>>>>>  
>>>>>> On Aug 12, 2014, at 12:27 PM, Timur Ismagilov < tismagi...@mail.ru > 
>>>>>> wrote:
>>>>>>  
>>>>>>> I don't have this error in OMPI 1.9a1r32252 and OMPI 1.8.1 (with --mca 
>>>>>>> oob_tcp_if_include ib0), but in all latest night snapshots i got this 
>>>>>>> error.
>>>>>>>  
>>>>>>>  
>>>>>>> Tue, 12 Aug 2014 13:08:12 +0000 от "Jeff Squyres (jsquyres)" < 
>>>>>>> jsquy...@cisco.com >:
>>>>>>> Are you running any kind of firewall on the node where mpirun is 
>>>>>>> invoked? Open MPI needs to be able to use arbitrary TCP ports between 
>>>>>>> the servers on which it runs.
>>>>>>>  
>>>>>>> This second mail seems to imply a bug in OMPI's oob_tcp_if_include 
>>>>>>> param handling, however -- it's supposed to be able to handle an 
>>>>>>> interface name (not just a network specification).
>>>>>>>  
>>>>>>> Ralph -- can you have a look?
>>>>>>>  
>>>>>>>  
>>>>>>> On Aug 12, 2014, at 8:41 AM, Timur Ismagilov < tismagi...@mail.ru > 
>>>>>>> wrote:
>>>>>>>  
>>>>>>>> When i add --mca oob_tcp_if_include ib0 (infiniband interface) to 
>>>>>>>> mpirun (as it was here:   
>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/07/24857.php   ) i 
>>>>>>>> got this output:
>>>>>>>>  
>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Querying component [isolated]
>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Query of component 
>>>>>>>> [isolated] set priority to 0
>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Querying component [rsh]
>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Query of component [rsh] set 
>>>>>>>> priority to 10
>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Querying component [slurm]
>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Query of component [slurm] 
>>>>>>>> set priority to 75
>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Selected component [slurm]
>>>>>>>> [compiler-2:08792] mca: base: components_register: registering oob 
>>>>>>>> components
>>>>>>>> [compiler-2:08792] mca: base: components_register: found loaded 
>>>>>>>> component tcp
>>>>>>>> [compiler-2:08792] mca: base: components_register: component tcp 
>>>>>>>> register function successful
>>>>>>>> [compiler-2:08792] mca: base: components_open: opening oob components
>>>>>>>> [compiler-2:08792] mca: base: components_open: found loaded component 
>>>>>>>> tcp
>>>>>>>> [compiler-2:08792] mca: base: components_open: component tcp open 
>>>>>>>> function successful
>>>>>>>> [compiler-2:08792] mca:oob:select: checking available component tcp
>>>>>>>> [compiler-2:08792] mca:oob:select: Querying component [tcp]
>>>>>>>> [compiler-2:08792] oob:tcp: component_available called
>>>>>>>> [compiler-2:08792] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>>>>>>>> [compiler-2:08792] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
>>>>>>>> [compiler-2:08792] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>>>>>>>> [compiler-2:08792] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
>>>>>>>> [compiler-2:08792] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
>>>>>>>> [compiler-2:08792] [[42190,0],0] oob:tcp:init adding 10.128.0.4 to our 
>>>>>>>> list of V4 connections
>>>>>>>> [compiler-2:08792] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
>>>>>>>> [compiler-2:08792] [[42190,0],0] TCP STARTUP
>>>>>>>> [compiler-2:08792] [[42190,0],0] attempting to bind to IPv4 port 0
>>>>>>>> [compiler-2:08792] [[42190,0],0] assigned IPv4 port 53883
>>>>>>>> [compiler-2:08792] mca:oob:select: Adding component to end
>>>>>>>> [compiler-2:08792] mca:oob:select: Found 1 active transports
>>>>>>>> [compiler-2:08792] mca: base: components_register: registering rml 
>>>>>>>> components
>>>>>>>> [compiler-2:08792] mca: base: components_register: found loaded 
>>>>>>>> component oob
>>>>>>>> [compiler-2:08792] mca: base: components_register: component oob has 
>>>>>>>> no register or open function
>>>>>>>> [compiler-2:08792] mca: base: components_open: opening rml components
>>>>>>>> [compiler-2:08792] mca: base: components_open: found loaded component 
>>>>>>>> oob
>>>>>>>> [compiler-2:08792] mca: base: components_open: component oob open 
>>>>>>>> function successful
>>>>>>>> [compiler-2:08792] orte_rml_base_select: initializing rml component oob
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 30 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 15 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 32 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 33 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 5 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 10 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 12 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 9 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 34 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 2 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 21 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 22 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 45 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 46 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 1 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv
>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 27 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> Daemon was launched on node1-128-01 - beginning to initialize
>>>>>>>> Daemon was launched on node1-128-02 - beginning to initialize
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> WARNING: An invalid value was given for oob_tcp_if_include. This
>>>>>>>> value will be ignored.
>>>>>>>>  
>>>>>>>> Local host: node1-128-01
>>>>>>>> Value: "ib0"
>>>>>>>> Message: Invalid specification (missing "/")
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> WARNING: An invalid value was given for oob_tcp_if_include. This
>>>>>>>> value will be ignored.
>>>>>>>>  
>>>>>>>> Local host: node1-128-02
>>>>>>>> Value: "ib0"
>>>>>>>> Message: Invalid specification (missing "/")
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> None of the TCP networks specified to be included for out-of-band 
>>>>>>>> communications
>>>>>>>> could be found:
>>>>>>>>  
>>>>>>>> Value given:
>>>>>>>>  
>>>>>>>> Please revise the specification and try again.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> None of the TCP networks specified to be included for out-of-band 
>>>>>>>> communications
>>>>>>>> could be found:
>>>>>>>>  
>>>>>>>> Value given:
>>>>>>>>  
>>>>>>>> Please revise the specification and try again.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> No network interfaces were found for out-of-band communications. We 
>>>>>>>> require
>>>>>>>> at least one available network for out-of-band messaging.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> No network interfaces were found for out-of-band communications. We 
>>>>>>>> require
>>>>>>>> at least one available network for out-of-band messaging.
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> It looks like orte_init failed for some reason; your parallel process 
>>>>>>>> is
>>>>>>>> likely to abort. There are many reasons that a parallel process can
>>>>>>>> fail during orte_init; some of which are due to configuration or
>>>>>>>> environment problems. This failure appears to be an internal failure;
>>>>>>>> here's some additional information (which may only be relevant to an
>>>>>>>> Open MPI developer):
>>>>>>>>  
>>>>>>>> orte_oob_base_select failed
>>>>>>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> It looks like orte_init failed for some reason; your parallel process 
>>>>>>>> is
>>>>>>>> likely to abort. There are many reasons that a parallel process can
>>>>>>>> fail during orte_init; some of which are due to configuration or
>>>>>>>> environment problems. This failure appears to be an internal failure;
>>>>>>>> here's some additional information (which may only be relevant to an
>>>>>>>> Open MPI developer):
>>>>>>>>  
>>>>>>>> orte_oob_base_select failed
>>>>>>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> srun: error: node1-128-02: task 1: Exited with exit code 213
>>>>>>>> srun: Terminating job step 657300.0
>>>>>>>> srun: error: node1-128-01: task 0: Exited with exit code 213
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> An ORTE daemon has unexpectedly failed after launch and before
>>>>>>>> communicating back to mpirun. This could be caused by a number
>>>>>>>> of factors, including an inability to create a connection back
>>>>>>>> to mpirun due to a lack of common network interfaces and/or no
>>>>>>>> route found between them. Please check network connectivity
>>>>>>>> (including firewalls and network routing requirements).
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> [compiler-2:08792] [[42190,0],0] orted_cmd: received halt_vm cmd
>>>>>>>> [compiler-2:08792] mca: base: close: component oob closed
>>>>>>>> [compiler-2:08792] mca: base: close: unloading component oob
>>>>>>>> [compiler-2:08792] [[42190,0],0] TCP SHUTDOWN
>>>>>>>> [compiler-2:08792] mca: base: close: component tcp closed
>>>>>>>> [compiler-2:08792] mca: base: close: unloading component tcp
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> Tue, 12 Aug 2014 16:14:58 +0400 от Timur Ismagilov < 
>>>>>>>> tismagi...@mail.ru >:
>>>>>>>> Hello!
>>>>>>>>  
>>>>>>>> I have Open MPI v1.8.2rc4r32485
>>>>>>>>  
>>>>>>>> When i run hello_c, I got this error message
>>>>>>>> $mpirun -np 2 hello_c
>>>>>>>>  
>>>>>>>> An ORTE daemon has unexpectedly failed after launch and before
>>>>>>>>  
>>>>>>>> communicating back to mpirun. This could be caused by a number
>>>>>>>> of factors, including an inability to create a connection back
>>>>>>>> to mpirun due to a lack of common network interfaces and/or no
>>>>>>>> route found between them. Please check network connectivity
>>>>>>>> (including firewalls and network routing requirements).
>>>>>>>>  
>>>>>>>> When i run with --debug-daemons --mca plm_base_verbose 5 -mca 
>>>>>>>> oob_base_verbose 10 -mca rml_base_verbose 10 i got this output:
>>>>>>>> $mpirun --debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 
>>>>>>>> 10 -mca rml_base_verbose 10 -np 2 hello_c
>>>>>>>>  
>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Querying component [isolated]
>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Query of component 
>>>>>>>> [isolated] set priority to 0
>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Querying component [rsh]
>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Query of component [rsh] set 
>>>>>>>> priority to 10
>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Querying component [slurm]
>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Query of component [slurm] 
>>>>>>>> set priority to 75
>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Selected component [slurm]
>>>>>>>> [compiler-2:08780] mca: base: components_register: registering oob 
>>>>>>>> components
>>>>>>>> [compiler-2:08780] mca: base: components_register: found loaded 
>>>>>>>> component tcp
>>>>>>>> [compiler-2:08780] mca: base: components_register: component tcp 
>>>>>>>> register function successful
>>>>>>>> [compiler-2:08780] mca: base: components_open: opening oob components
>>>>>>>> [compiler-2:08780] mca: base: components_open: found loaded component 
>>>>>>>> tcp
>>>>>>>> [compiler-2:08780] mca: base: components_open: component tcp open 
>>>>>>>> function successful
>>>>>>>> [compiler-2:08780] mca:oob:select: checking available component tcp
>>>>>>>> [compiler-2:08780] mca:oob:select: Querying component [tcp]
>>>>>>>> [compiler-2:08780] oob:tcp: component_available called
>>>>>>>> [compiler-2:08780] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>>>>>>>> [compiler-2:08780] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
>>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.251.53 to 
>>>>>>>> our list of V4 connections
>>>>>>>> [compiler-2:08780] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.0.4 to our 
>>>>>>>> list of V4 connections
>>>>>>>> [compiler-2:08780] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
>>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.2.251.14 to 
>>>>>>>> our list of V4 connections
>>>>>>>> [compiler-2:08780] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
>>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.128.0.4 to our 
>>>>>>>> list of V4 connections
>>>>>>>> [compiler-2:08780] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
>>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 93.180.7.38 to 
>>>>>>>> our list of V4 connections
>>>>>>>> [compiler-2:08780] [[42202,0],0] TCP STARTUP
>>>>>>>> [compiler-2:08780] [[42202,0],0] attempting to bind to IPv4 port 0
>>>>>>>> [compiler-2:08780] [[42202,0],0] assigned IPv4 port 38420
>>>>>>>> [compiler-2:08780] mca:oob:select: Adding component to end
>>>>>>>> [compiler-2:08780] mca:oob:select: Found 1 active transports
>>>>>>>> [compiler-2:08780] mca: base: components_register: registering rml 
>>>>>>>> components
>>>>>>>> [compiler-2:08780] mca: base: components_register: found loaded 
>>>>>>>> component oob
>>>>>>>> [compiler-2:08780] mca: base: components_register: component oob has 
>>>>>>>> no register or open function
>>>>>>>> [compiler-2:08780] mca: base: components_open: opening rml components
>>>>>>>> [compiler-2:08780] mca: base: components_open: found loaded component 
>>>>>>>> oob
>>>>>>>> [compiler-2:08780] mca: base: components_open: component oob open 
>>>>>>>> function successful
>>>>>>>> [compiler-2:08780] orte_rml_base_select: initializing rml component oob
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 30 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 15 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 32 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 33 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 5 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 10 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 12 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 9 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 34 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 2 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 21 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 22 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 45 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 46 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 1 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv
>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 27 for 
>>>>>>>> peer [[WILDCARD],WILDCARD]
>>>>>>>> Daemon was launched on node1-130-08 - beginning to initialize
>>>>>>>> Daemon was launched on node1-130-03 - beginning to initialize
>>>>>>>> Daemon was launched on node1-130-05 - beginning to initialize
>>>>>>>> Daemon was launched on node1-130-02 - beginning to initialize
>>>>>>>> Daemon was launched on node1-130-01 - beginning to initialize
>>>>>>>> Daemon was launched on node1-130-04 - beginning to initialize
>>>>>>>> Daemon was launched on node1-130-07 - beginning to initialize
>>>>>>>> Daemon was launched on node1-130-06 - beginning to initialize
>>>>>>>> Daemon [[42202,0],3] checking in as pid 7178 on host node1-130-03
>>>>>>>> [node1-130-03:07178] [[42202,0],3] orted: up and running - waiting for 
>>>>>>>> commands!
>>>>>>>> Daemon [[42202,0],2] checking in as pid 13581 on host node1-130-02
>>>>>>>> [node1-130-02:13581] [[42202,0],2] orted: up and running - waiting for 
>>>>>>>> commands!
>>>>>>>> Daemon [[42202,0],1] checking in as pid 17220 on host node1-130-01
>>>>>>>> [node1-130-01:17220] [[42202,0],1] orted: up and running - waiting for 
>>>>>>>> commands!
>>>>>>>> Daemon [[42202,0],5] checking in as pid 6663 on host node1-130-05
>>>>>>>> [node1-130-05:06663] [[42202,0],5] orted: up and running - waiting for 
>>>>>>>> commands!
>>>>>>>> Daemon [[42202,0],8] checking in as pid 6683 on host node1-130-08
>>>>>>>> [node1-130-08:06683] [[42202,0],8] orted: up and running - waiting for 
>>>>>>>> commands!
>>>>>>>> Daemon [[42202,0],7] checking in as pid 7877 on host node1-130-07
>>>>>>>> [node1-130-07:07877] [[42202,0],7] orted: up and running - waiting for 
>>>>>>>> commands!
>>>>>>>> Daemon [[42202,0],4] checking in as pid 7735 on host node1-130-04
>>>>>>>> [node1-130-04:07735] [[42202,0],4] orted: up and running - waiting for 
>>>>>>>> commands!
>>>>>>>> Daemon [[42202,0],6] checking in as pid 8451 on host node1-130-06
>>>>>>>> [node1-130-06:08451] [[42202,0],6] orted: up and running - waiting for 
>>>>>>>> commands!
>>>>>>>> srun: error: node1-130-03: task 2: Exited with exit code 1
>>>>>>>> srun: Terminating job step 657040.1
>>>>>>>> srun: error: node1-130-02: task 1: Exited with exit code 1
>>>>>>>> slurmd[node1-130-04]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 
>>>>>>>> WITH SIGNAL 9 ***
>>>>>>>> slurmd[node1-130-07]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 
>>>>>>>> WITH SIGNAL 9 ***
>>>>>>>> slurmd[node1-130-06]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 
>>>>>>>> WITH SIGNAL 9 ***
>>>>>>>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
>>>>>>>> srun: error: node1-130-01: task 0: Exited with exit code 1
>>>>>>>> srun: error: node1-130-05: task 4: Exited with exit code 1
>>>>>>>> srun: error: node1-130-08: task 7: Exited with exit code 1
>>>>>>>> srun: error: node1-130-07: task 6: Exited with exit code 1
>>>>>>>> srun: error: node1-130-04: task 3: Killed
>>>>>>>> srun: error: node1-130-06: task 5: Killed
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> An ORTE daemon has unexpectedly failed after launch and before
>>>>>>>> communicating back to mpirun. This could be caused by a number
>>>>>>>> of factors, including an inability to create a connection back
>>>>>>>> to mpirun due to a lack of common network interfaces and/or no
>>>>>>>> route found between them. Please check network connectivity
>>>>>>>> (including firewalls and network routing requirements).
>>>>>>>> --------------------------------------------------------------------------
>>>>>>>> [compiler-2:08780] [[42202,0],0] orted_cmd: received halt_vm cmd
>>>>>>>> [compiler-2:08780] mca: base: close: component oob closed
>>>>>>>> [compiler-2:08780] mca: base: close: unloading component oob
>>>>>>>> [compiler-2:08780] [[42202,0],0] TCP SHUTDOWN
>>>>>>>> [compiler-2:08780] mca: base: close: component tcp closed
>>>>>>>> [compiler-2:08780] mca: base: close: unloading component tcp
>>>>>>>>  
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>>   us...@open-mpi.org
>>>>>>>> Subscription:   http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:   
>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/24987.php
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>>   us...@open-mpi.org
>>>>>>>> Subscription:   http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:   
>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/24988.php
>>>>>>>  
>>>>>>>  
>>>>>>> --  
>>>>>>> Jeff Squyres
>>>>>>>   jsquy...@cisco.com
>>>>>>> For corporate legal information go to:   
>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>  
>>>>>>>  
>>>>>>>  
>>>>>>>  
>>>>>>> 
>>>>>>  
>>>>>>  
>>>>>> --  
>>>>>> Jeff Squyres
>>>>>>   jsquy...@cisco.com
>>>>>> For corporate legal information go to:   
>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>  
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>>   us...@open-mpi.org
>>>>>> Subscription:   http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this post:   
>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25001.php
>>>>>
>>>>>
>>>>>--  
>>>>>Jeff Squyres
>>>>>jsquy...@cisco.com
>>>>>For corporate legal information go to:   
>>>>>http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>_______________________________________________
>>>>users mailing list
>>>>us...@open-mpi.org
>>>>Subscription:   http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>Link to this post:   
>>>>http://www.open-mpi.org/community/lists/users/2014/08/25086.php
>>>
>>>_______________________________________________
>>>users mailing list
>>>us...@open-mpi.org
>>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>Link to this post:  
>>>http://www.open-mpi.org/community/lists/users/2014/08/25093.php
>>
>>_______________________________________________
>>users mailing list
>>us...@open-mpi.org
>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>Link to this post:  
>>http://www.open-mpi.org/community/lists/users/2014/08/25094.php
>_______________________________________________
>users mailing list
>us...@open-mpi.org
>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post:  
>http://www.open-mpi.org/community/lists/users/2014/08/25095.php




Reply via email to