yes, i know - it is cmr'd On Aug 20, 2014, at 10:26 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
> btw, we get same error in v1.8 branch as well. > > > On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain <r...@open-mpi.org> wrote: > It was not yet fixed - but should be now. > > On Aug 20, 2014, at 6:39 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: > >> Hello! >> >> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i still have >> the problem >> >> a) >> $ mpirun -np 1 ./hello_c >> >> -------------------------------------------------------------------------- >> An ORTE daemon has unexpectedly failed after launch and before >> communicating back to mpirun. This could be caused by a number >> of factors, including an inability to create a connection back >> to mpirun due to a lack of common network interfaces and/or no >> route found between them. Please check network connectivity >> (including firewalls and network routing requirements). >> -------------------------------------------------------------------------- >> >> b) >> $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c >> -------------------------------------------------------------------------- >> An ORTE daemon has unexpectedly failed after launch and before >> communicating back to mpirun. This could be caused by a number >> of factors, including an inability to create a connection back >> to mpirun due to a lack of common network interfaces and/or no >> route found between them. Please check network connectivity >> (including firewalls and network routing requirements). >> -------------------------------------------------------------------------- >> >> c) >> >> $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose >> 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c >> >> [compiler-2:14673] mca:base:select:( plm) Querying component [isolated] >> [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set >> priority to 0 >> [compiler-2:14673] mca:base:select:( plm) Querying component [rsh] >> [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [compiler-2:14673] mca:base:select:( plm) Querying component [slurm] >> [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set >> priority to 75 >> [compiler-2:14673] mca:base:select:( plm) Selected component [slurm] >> [compiler-2:14673] mca: base: components_register: registering oob components >> [compiler-2:14673] mca: base: components_register: found loaded component tcp >> [compiler-2:14673] mca: base: components_register: component tcp register >> function successful >> [compiler-2:14673] mca: base: components_open: opening oob components >> [compiler-2:14673] mca: base: components_open: found loaded component tcp >> [compiler-2:14673] mca: base: components_open: component tcp open function >> successful >> [compiler-2:14673] mca:oob:select: checking available component tcp >> [compiler-2:14673] mca:oob:select: Querying component [tcp] >> [compiler-2:14673] oob:tcp: component_available called >> [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >> [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >> [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >> [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >> [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >> [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list >> of V4 connections >> [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >> [compiler-2:14673] [[49095,0],0] TCP STARTUP >> [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0 >> [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460 >> [compiler-2:14673] mca:oob:select: Adding component to end >> [compiler-2:14673] mca:oob:select: Found 1 active transports >> [compiler-2:14673] mca: base: components_register: registering rml components >> [compiler-2:14673] mca: base: components_register: found loaded component oob >> [compiler-2:14673] mca: base: components_register: component oob has no >> register or open function >> [compiler-2:14673] mca: base: components_open: opening rml components >> [compiler-2:14673] mca: base: components_open: found loaded component oob >> [compiler-2:14673] mca: base: components_open: component oob open function >> successful >> [compiler-2:14673] orte_rml_base_select: initializing rml component oob >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 10 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 12 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 9 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 34 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 2 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 21 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 22 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 45 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 46 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 1 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:14673] [[49095,0],0] posting recv >> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 27 for peer >> [[WILDCARD],WILDCARD] >> Daemon was launched on node1-128-01 - beginning to initialize >> -------------------------------------------------------------------------- >> WARNING: An invalid value was given for oob_tcp_if_include. This >> value will be ignored. >> >> Local host: node1-128-01 >> Value: "ib0" >> Message: Invalid specification (missing "/") >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> None of the TCP networks specified to be included for out-of-band >> communications >> could be found: >> >> Value given: >> >> Please revise the specification and try again. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> No network interfaces were found for out-of-band communications. We require >> at least one available network for out-of-band messaging. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> It looks like orte_init failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during orte_init; some of which are due to configuration or >> environment problems. This failure appears to be an internal failure; >> here's some additional information (which may only be relevant to an >> Open MPI developer): >> >> orte_oob_base_select failed >> --> Returned value (null) (-43) instead of ORTE_SUCCESS >> -------------------------------------------------------------------------- >> srun: error: node1-128-01: task 0: Exited with exit code 213 >> srun: Terminating job step 661215.0 >> -------------------------------------------------------------------------- >> An ORTE daemon has unexpectedly failed after launch and before >> communicating back to mpirun. This could be caused by a number >> of factors, including an inability to create a connection back >> to mpirun due to a lack of common network interfaces and/or no >> route found between them. Please check network connectivity >> (including firewalls and network routing requirements). >> -------------------------------------------------------------------------- >> [compiler-2:14673] [[49095,0],0] orted_cmd: received halt_vm cmd >> [compiler-2:14673] mca: base: close: component oob closed >> [compiler-2:14673] mca: base: close: unloading component oob >> [compiler-2:14673] [[49095,0],0] TCP SHUTDOWN >> [compiler-2:14673] mca: base: close: component tcp closed >> [compiler-2:14673] mca: base: close: unloading component tcp >> >> >> >> >> Tue, 12 Aug 2014 18:33:24 +0000 от "Jeff Squyres (jsquyres)" >> <jsquy...@cisco.com>: >> I filed the following ticket: >> >> https://svn.open-mpi.org/trac/ompi/ticket/4857 >> >> >> On Aug 12, 2014, at 12:39 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> >> wrote: >> >> > (please keep the users list CC'ed) >> > >> > We talked about this on the weekly engineering call today. Ralph has an >> > idea what is happening -- I need to do a little investigation today and >> > file a bug. I'll make sure you're CC'ed on the bug ticket. >> > >> > >> > >> > On Aug 12, 2014, at 12:27 PM, Timur Ismagilov <tismagi...@mail.ru> wrote: >> > >> >> I don't have this error in OMPI 1.9a1r32252 and OMPI 1.8.1 (with --mca >> >> oob_tcp_if_include ib0), but in all latest night snapshots i got this >> >> error. >> >> >> >> >> >> Tue, 12 Aug 2014 13:08:12 +0000 от "Jeff Squyres (jsquyres)" >> >> <jsquy...@cisco.com>: >> >> Are you running any kind of firewall on the node where mpirun is invoked? >> >> Open MPI needs to be able to use arbitrary TCP ports between the servers >> >> on which it runs. >> >> >> >> This second mail seems to imply a bug in OMPI's oob_tcp_if_include param >> >> handling, however -- it's supposed to be able to handle an interface name >> >> (not just a network specification). >> >> >> >> Ralph -- can you have a look? >> >> >> >> >> >> On Aug 12, 2014, at 8:41 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: >> >> >> >>> When i add --mca oob_tcp_if_include ib0 (infiniband interface) to mpirun >> >>> (as it was here: >> >>> http://www.open-mpi.org/community/lists/users/2014/07/24857.php ) i got >> >>> this output: >> >>> >> >>> [compiler-2:08792] mca:base:select:( plm) Querying component [isolated] >> >>> [compiler-2:08792] mca:base:select:( plm) Query of component [isolated] >> >>> set priority to 0 >> >>> [compiler-2:08792] mca:base:select:( plm) Querying component [rsh] >> >>> [compiler-2:08792] mca:base:select:( plm) Query of component [rsh] set >> >>> priority to 10 >> >>> [compiler-2:08792] mca:base:select:( plm) Querying component [slurm] >> >>> [compiler-2:08792] mca:base:select:( plm) Query of component [slurm] set >> >>> priority to 75 >> >>> [compiler-2:08792] mca:base:select:( plm) Selected component [slurm] >> >>> [compiler-2:08792] mca: base: components_register: registering oob >> >>> components >> >>> [compiler-2:08792] mca: base: components_register: found loaded >> >>> component tcp >> >>> [compiler-2:08792] mca: base: components_register: component tcp >> >>> register function successful >> >>> [compiler-2:08792] mca: base: components_open: opening oob components >> >>> [compiler-2:08792] mca: base: components_open: found loaded component tcp >> >>> [compiler-2:08792] mca: base: components_open: component tcp open >> >>> function successful >> >>> [compiler-2:08792] mca:oob:select: checking available component tcp >> >>> [compiler-2:08792] mca:oob:select: Querying component [tcp] >> >>> [compiler-2:08792] oob:tcp: component_available called >> >>> [compiler-2:08792] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >> >>> [compiler-2:08792] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >> >>> [compiler-2:08792] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >> >>> [compiler-2:08792] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >> >>> [compiler-2:08792] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >> >>> [compiler-2:08792] [[42190,0],0] oob:tcp:init adding 10.128.0.4 to our >> >>> list of V4 connections >> >>> [compiler-2:08792] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >> >>> [compiler-2:08792] [[42190,0],0] TCP STARTUP >> >>> [compiler-2:08792] [[42190,0],0] attempting to bind to IPv4 port 0 >> >>> [compiler-2:08792] [[42190,0],0] assigned IPv4 port 53883 >> >>> [compiler-2:08792] mca:oob:select: Adding component to end >> >>> [compiler-2:08792] mca:oob:select: Found 1 active transports >> >>> [compiler-2:08792] mca: base: components_register: registering rml >> >>> components >> >>> [compiler-2:08792] mca: base: components_register: found loaded >> >>> component oob >> >>> [compiler-2:08792] mca: base: components_register: component oob has no >> >>> register or open function >> >>> [compiler-2:08792] mca: base: components_open: opening rml components >> >>> [compiler-2:08792] mca: base: components_open: found loaded component oob >> >>> [compiler-2:08792] mca: base: components_open: component oob open >> >>> function successful >> >>> [compiler-2:08792] orte_rml_base_select: initializing rml component oob >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 30 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 15 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 32 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 33 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 5 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 10 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 12 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 9 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 34 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 2 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 21 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 22 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 45 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 46 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 1 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08792] [[42190,0],0] posting recv >> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 27 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> Daemon was launched on node1-128-01 - beginning to initialize >> >>> Daemon was launched on node1-128-02 - beginning to initialize >> >>> -------------------------------------------------------------------------- >> >>> WARNING: An invalid value was given for oob_tcp_if_include. This >> >>> value will be ignored. >> >>> >> >>> Local host: node1-128-01 >> >>> Value: "ib0" >> >>> Message: Invalid specification (missing "/") >> >>> -------------------------------------------------------------------------- >> >>> -------------------------------------------------------------------------- >> >>> WARNING: An invalid value was given for oob_tcp_if_include. This >> >>> value will be ignored. >> >>> >> >>> Local host: node1-128-02 >> >>> Value: "ib0" >> >>> Message: Invalid specification (missing "/") >> >>> -------------------------------------------------------------------------- >> >>> -------------------------------------------------------------------------- >> >>> None of the TCP networks specified to be included for out-of-band >> >>> communications >> >>> could be found: >> >>> >> >>> Value given: >> >>> >> >>> Please revise the specification and try again. >> >>> -------------------------------------------------------------------------- >> >>> -------------------------------------------------------------------------- >> >>> None of the TCP networks specified to be included for out-of-band >> >>> communications >> >>> could be found: >> >>> >> >>> Value given: >> >>> >> >>> Please revise the specification and try again. >> >>> -------------------------------------------------------------------------- >> >>> -------------------------------------------------------------------------- >> >>> No network interfaces were found for out-of-band communications. We >> >>> require >> >>> at least one available network for out-of-band messaging. >> >>> -------------------------------------------------------------------------- >> >>> -------------------------------------------------------------------------- >> >>> No network interfaces were found for out-of-band communications. We >> >>> require >> >>> at least one available network for out-of-band messaging. >> >>> -------------------------------------------------------------------------- >> >>> -------------------------------------------------------------------------- >> >>> It looks like orte_init failed for some reason; your parallel process is >> >>> likely to abort. There are many reasons that a parallel process can >> >>> fail during orte_init; some of which are due to configuration or >> >>> environment problems. This failure appears to be an internal failure; >> >>> here's some additional information (which may only be relevant to an >> >>> Open MPI developer): >> >>> >> >>> orte_oob_base_select failed >> >>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >> >>> -------------------------------------------------------------------------- >> >>> -------------------------------------------------------------------------- >> >>> It looks like orte_init failed for some reason; your parallel process is >> >>> likely to abort. There are many reasons that a parallel process can >> >>> fail during orte_init; some of which are due to configuration or >> >>> environment problems. This failure appears to be an internal failure; >> >>> here's some additional information (which may only be relevant to an >> >>> Open MPI developer): >> >>> >> >>> orte_oob_base_select failed >> >>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >> >>> -------------------------------------------------------------------------- >> >>> srun: error: node1-128-02: task 1: Exited with exit code 213 >> >>> srun: Terminating job step 657300.0 >> >>> srun: error: node1-128-01: task 0: Exited with exit code 213 >> >>> -------------------------------------------------------------------------- >> >>> An ORTE daemon has unexpectedly failed after launch and before >> >>> communicating back to mpirun. This could be caused by a number >> >>> of factors, including an inability to create a connection back >> >>> to mpirun due to a lack of common network interfaces and/or no >> >>> route found between them. Please check network connectivity >> >>> (including firewalls and network routing requirements). >> >>> -------------------------------------------------------------------------- >> >>> [compiler-2:08792] [[42190,0],0] orted_cmd: received halt_vm cmd >> >>> [compiler-2:08792] mca: base: close: component oob closed >> >>> [compiler-2:08792] mca: base: close: unloading component oob >> >>> [compiler-2:08792] [[42190,0],0] TCP SHUTDOWN >> >>> [compiler-2:08792] mca: base: close: component tcp closed >> >>> [compiler-2:08792] mca: base: close: unloading component tcp >> >>> >> >>> >> >>> >> >>> Tue, 12 Aug 2014 16:14:58 +0400 от Timur Ismagilov <tismagi...@mail.ru>: >> >>> Hello! >> >>> >> >>> I have Open MPI v1.8.2rc4r32485 >> >>> >> >>> When i run hello_c, I got this error message >> >>> $mpirun -np 2 hello_c >> >>> >> >>> An ORTE daemon has unexpectedly failed after launch and before >> >>> >> >>> communicating back to mpirun. This could be caused by a number >> >>> of factors, including an inability to create a connection back >> >>> to mpirun due to a lack of common network interfaces and/or no >> >>> route found between them. Please check network connectivity >> >>> (including firewalls and network routing requirements). >> >>> >> >>> When i run with --debug-daemons --mca plm_base_verbose 5 -mca >> >>> oob_base_verbose 10 -mca rml_base_verbose 10 i got this output: >> >>> $mpirun --debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose >> >>> 10 -mca rml_base_verbose 10 -np 2 hello_c >> >>> >> >>> [compiler-2:08780] mca:base:select:( plm) Querying component [isolated] >> >>> [compiler-2:08780] mca:base:select:( plm) Query of component [isolated] >> >>> set priority to 0 >> >>> [compiler-2:08780] mca:base:select:( plm) Querying component [rsh] >> >>> [compiler-2:08780] mca:base:select:( plm) Query of component [rsh] set >> >>> priority to 10 >> >>> [compiler-2:08780] mca:base:select:( plm) Querying component [slurm] >> >>> [compiler-2:08780] mca:base:select:( plm) Query of component [slurm] set >> >>> priority to 75 >> >>> [compiler-2:08780] mca:base:select:( plm) Selected component [slurm] >> >>> [compiler-2:08780] mca: base: components_register: registering oob >> >>> components >> >>> [compiler-2:08780] mca: base: components_register: found loaded >> >>> component tcp >> >>> [compiler-2:08780] mca: base: components_register: component tcp >> >>> register function successful >> >>> [compiler-2:08780] mca: base: components_open: opening oob components >> >>> [compiler-2:08780] mca: base: components_open: found loaded component tcp >> >>> [compiler-2:08780] mca: base: components_open: component tcp open >> >>> function successful >> >>> [compiler-2:08780] mca:oob:select: checking available component tcp >> >>> [compiler-2:08780] mca:oob:select: Querying component [tcp] >> >>> [compiler-2:08780] oob:tcp: component_available called >> >>> [compiler-2:08780] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >> >>> [compiler-2:08780] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.251.53 to our >> >>> list of V4 connections >> >>> [compiler-2:08780] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.0.4 to our >> >>> list of V4 connections >> >>> [compiler-2:08780] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.2.251.14 to our >> >>> list of V4 connections >> >>> [compiler-2:08780] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.128.0.4 to our >> >>> list of V4 connections >> >>> [compiler-2:08780] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 93.180.7.38 to our >> >>> list of V4 connections >> >>> [compiler-2:08780] [[42202,0],0] TCP STARTUP >> >>> [compiler-2:08780] [[42202,0],0] attempting to bind to IPv4 port 0 >> >>> [compiler-2:08780] [[42202,0],0] assigned IPv4 port 38420 >> >>> [compiler-2:08780] mca:oob:select: Adding component to end >> >>> [compiler-2:08780] mca:oob:select: Found 1 active transports >> >>> [compiler-2:08780] mca: base: components_register: registering rml >> >>> components >> >>> [compiler-2:08780] mca: base: components_register: found loaded >> >>> component oob >> >>> [compiler-2:08780] mca: base: components_register: component oob has no >> >>> register or open function >> >>> [compiler-2:08780] mca: base: components_open: opening rml components >> >>> [compiler-2:08780] mca: base: components_open: found loaded component oob >> >>> [compiler-2:08780] mca: base: components_open: component oob open >> >>> function successful >> >>> [compiler-2:08780] orte_rml_base_select: initializing rml component oob >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 30 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 15 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 32 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 33 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 5 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 10 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 12 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 9 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 34 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 2 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 21 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 22 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 45 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 46 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 1 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> [compiler-2:08780] [[42202,0],0] posting recv >> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 27 for >> >>> peer [[WILDCARD],WILDCARD] >> >>> Daemon was launched on node1-130-08 - beginning to initialize >> >>> Daemon was launched on node1-130-03 - beginning to initialize >> >>> Daemon was launched on node1-130-05 - beginning to initialize >> >>> Daemon was launched on node1-130-02 - beginning to initialize >> >>> Daemon was launched on node1-130-01 - beginning to initialize >> >>> Daemon was launched on node1-130-04 - beginning to initialize >> >>> Daemon was launched on node1-130-07 - beginning to initialize >> >>> Daemon was launched on node1-130-06 - beginning to initialize >> >>> Daemon [[42202,0],3] checking in as pid 7178 on host node1-130-03 >> >>> [node1-130-03:07178] [[42202,0],3] orted: up and running - waiting for >> >>> commands! >> >>> Daemon [[42202,0],2] checking in as pid 13581 on host node1-130-02 >> >>> [node1-130-02:13581] [[42202,0],2] orted: up and running - waiting for >> >>> commands! >> >>> Daemon [[42202,0],1] checking in as pid 17220 on host node1-130-01 >> >>> [node1-130-01:17220] [[42202,0],1] orted: up and running - waiting for >> >>> commands! >> >>> Daemon [[42202,0],5] checking in as pid 6663 on host node1-130-05 >> >>> [node1-130-05:06663] [[42202,0],5] orted: up and running - waiting for >> >>> commands! >> >>> Daemon [[42202,0],8] checking in as pid 6683 on host node1-130-08 >> >>> [node1-130-08:06683] [[42202,0],8] orted: up and running - waiting for >> >>> commands! >> >>> Daemon [[42202,0],7] checking in as pid 7877 on host node1-130-07 >> >>> [node1-130-07:07877] [[42202,0],7] orted: up and running - waiting for >> >>> commands! >> >>> Daemon [[42202,0],4] checking in as pid 7735 on host node1-130-04 >> >>> [node1-130-04:07735] [[42202,0],4] orted: up and running - waiting for >> >>> commands! >> >>> Daemon [[42202,0],6] checking in as pid 8451 on host node1-130-06 >> >>> [node1-130-06:08451] [[42202,0],6] orted: up and running - waiting for >> >>> commands! >> >>> srun: error: node1-130-03: task 2: Exited with exit code 1 >> >>> srun: Terminating job step 657040.1 >> >>> srun: error: node1-130-02: task 1: Exited with exit code 1 >> >>> slurmd[node1-130-04]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 >> >>> WITH SIGNAL 9 *** >> >>> slurmd[node1-130-07]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 >> >>> WITH SIGNAL 9 *** >> >>> slurmd[node1-130-06]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 >> >>> WITH SIGNAL 9 *** >> >>> srun: Job step aborted: Waiting up to 2 seconds for job step to finish. >> >>> srun: error: node1-130-01: task 0: Exited with exit code 1 >> >>> srun: error: node1-130-05: task 4: Exited with exit code 1 >> >>> srun: error: node1-130-08: task 7: Exited with exit code 1 >> >>> srun: error: node1-130-07: task 6: Exited with exit code 1 >> >>> srun: error: node1-130-04: task 3: Killed >> >>> srun: error: node1-130-06: task 5: Killed >> >>> -------------------------------------------------------------------------- >> >>> An ORTE daemon has unexpectedly failed after launch and before >> >>> communicating back to mpirun. This could be caused by a number >> >>> of factors, including an inability to create a connection back >> >>> to mpirun due to a lack of common network interfaces and/or no >> >>> route found between them. Please check network connectivity >> >>> (including firewalls and network routing requirements). >> >>> -------------------------------------------------------------------------- >> >>> [compiler-2:08780] [[42202,0],0] orted_cmd: received halt_vm cmd >> >>> [compiler-2:08780] mca: base: close: component oob closed >> >>> [compiler-2:08780] mca: base: close: unloading component oob >> >>> [compiler-2:08780] [[42202,0],0] TCP SHUTDOWN >> >>> [compiler-2:08780] mca: base: close: component tcp closed >> >>> [compiler-2:08780] mca: base: close: unloading component tcp >> >>> >> >>> _______________________________________________ >> >>> users mailing list >> >>> us...@open-mpi.org >> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>> Link to this post: >> >>> http://www.open-mpi.org/community/lists/users/2014/08/24987.php >> >>> >> >>> >> >>> >> >>> _______________________________________________ >> >>> users mailing list >> >>> us...@open-mpi.org >> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> >>> Link to this post: >> >>> http://www.open-mpi.org/community/lists/users/2014/08/24988.php >> >> >> >> >> >> -- >> >> Jeff Squyres >> >> jsquy...@cisco.com >> >> For corporate legal information go to: >> >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> >> >> >> >> >> >> >> >> > >> > >> > -- >> > Jeff Squyres >> > jsquy...@cisco.com >> > For corporate legal information go to: >> > http://www.cisco.com/web/about/doing_business/legal/cri/ >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> > Link to this post: >> > http://www.open-mpi.org/community/lists/users/2014/08/25001.php >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25086.php > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25093.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25094.php