(please keep the users list CC'ed) We talked about this on the weekly engineering call today. Ralph has an idea what is happening -- I need to do a little investigation today and file a bug. I'll make sure you're CC'ed on the bug ticket.
On Aug 12, 2014, at 12:27 PM, Timur Ismagilov <tismagi...@mail.ru> wrote: > I don't have this error in OMPI 1.9a1r32252 and OMPI 1.8.1 (with --mca > oob_tcp_if_include ib0), but in all latest night snapshots i got this error. > > > Tue, 12 Aug 2014 13:08:12 +0000 от "Jeff Squyres (jsquyres)" > <jsquy...@cisco.com>: > Are you running any kind of firewall on the node where mpirun is invoked? > Open MPI needs to be able to use arbitrary TCP ports between the servers on > which it runs. > > This second mail seems to imply a bug in OMPI's oob_tcp_if_include param > handling, however -- it's supposed to be able to handle an interface name > (not just a network specification). > > Ralph -- can you have a look? > > > On Aug 12, 2014, at 8:41 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: > > > When i add --mca oob_tcp_if_include ib0 (infiniband interface) to mpirun > > (as it was here: > > http://www.open-mpi.org/community/lists/users/2014/07/24857.php ) i got > > this output: > > > > [compiler-2:08792] mca:base:select:( plm) Querying component [isolated] > > [compiler-2:08792] mca:base:select:( plm) Query of component [isolated] set > > priority to 0 > > [compiler-2:08792] mca:base:select:( plm) Querying component [rsh] > > [compiler-2:08792] mca:base:select:( plm) Query of component [rsh] set > > priority to 10 > > [compiler-2:08792] mca:base:select:( plm) Querying component [slurm] > > [compiler-2:08792] mca:base:select:( plm) Query of component [slurm] set > > priority to 75 > > [compiler-2:08792] mca:base:select:( plm) Selected component [slurm] > > [compiler-2:08792] mca: base: components_register: registering oob > > components > > [compiler-2:08792] mca: base: components_register: found loaded component > > tcp > > [compiler-2:08792] mca: base: components_register: component tcp register > > function successful > > [compiler-2:08792] mca: base: components_open: opening oob components > > [compiler-2:08792] mca: base: components_open: found loaded component tcp > > [compiler-2:08792] mca: base: components_open: component tcp open function > > successful > > [compiler-2:08792] mca:oob:select: checking available component tcp > > [compiler-2:08792] mca:oob:select: Querying component [tcp] > > [compiler-2:08792] oob:tcp: component_available called > > [compiler-2:08792] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > > [compiler-2:08792] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 > > [compiler-2:08792] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 > > [compiler-2:08792] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 > > [compiler-2:08792] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 > > [compiler-2:08792] [[42190,0],0] oob:tcp:init adding 10.128.0.4 to our list > > of V4 connections > > [compiler-2:08792] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 > > [compiler-2:08792] [[42190,0],0] TCP STARTUP > > [compiler-2:08792] [[42190,0],0] attempting to bind to IPv4 port 0 > > [compiler-2:08792] [[42190,0],0] assigned IPv4 port 53883 > > [compiler-2:08792] mca:oob:select: Adding component to end > > [compiler-2:08792] mca:oob:select: Found 1 active transports > > [compiler-2:08792] mca: base: components_register: registering rml > > components > > [compiler-2:08792] mca: base: components_register: found loaded component > > oob > > [compiler-2:08792] mca: base: components_register: component oob has no > > register or open function > > [compiler-2:08792] mca: base: components_open: opening rml components > > [compiler-2:08792] mca: base: components_open: found loaded component oob > > [compiler-2:08792] mca: base: components_open: component oob open function > > successful > > [compiler-2:08792] orte_rml_base_select: initializing rml component oob > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 30 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 15 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 32 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 33 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 5 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 10 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 12 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 9 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 34 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 2 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 21 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 22 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 45 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 46 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 1 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08792] [[42190,0],0] posting recv > > [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 27 for peer > > [[WILDCARD],WILDCARD] > > Daemon was launched on node1-128-01 - beginning to initialize > > Daemon was launched on node1-128-02 - beginning to initialize > > -------------------------------------------------------------------------- > > WARNING: An invalid value was given for oob_tcp_if_include. This > > value will be ignored. > > > > Local host: node1-128-01 > > Value: "ib0" > > Message: Invalid specification (missing "/") > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > WARNING: An invalid value was given for oob_tcp_if_include. This > > value will be ignored. > > > > Local host: node1-128-02 > > Value: "ib0" > > Message: Invalid specification (missing "/") > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > None of the TCP networks specified to be included for out-of-band > > communications > > could be found: > > > > Value given: > > > > Please revise the specification and try again. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > None of the TCP networks specified to be included for out-of-band > > communications > > could be found: > > > > Value given: > > > > Please revise the specification and try again. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > No network interfaces were found for out-of-band communications. We require > > at least one available network for out-of-band messaging. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > No network interfaces were found for out-of-band communications. We require > > at least one available network for out-of-band messaging. > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > It looks like orte_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during orte_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > > orte_oob_base_select failed > > --> Returned value (null) (-43) instead of ORTE_SUCCESS > > -------------------------------------------------------------------------- > > -------------------------------------------------------------------------- > > It looks like orte_init failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during orte_init; some of which are due to configuration or > > environment problems. This failure appears to be an internal failure; > > here's some additional information (which may only be relevant to an > > Open MPI developer): > > > > orte_oob_base_select failed > > --> Returned value (null) (-43) instead of ORTE_SUCCESS > > -------------------------------------------------------------------------- > > srun: error: node1-128-02: task 1: Exited with exit code 213 > > srun: Terminating job step 657300.0 > > srun: error: node1-128-01: task 0: Exited with exit code 213 > > -------------------------------------------------------------------------- > > An ORTE daemon has unexpectedly failed after launch and before > > communicating back to mpirun. This could be caused by a number > > of factors, including an inability to create a connection back > > to mpirun due to a lack of common network interfaces and/or no > > route found between them. Please check network connectivity > > (including firewalls and network routing requirements). > > -------------------------------------------------------------------------- > > [compiler-2:08792] [[42190,0],0] orted_cmd: received halt_vm cmd > > [compiler-2:08792] mca: base: close: component oob closed > > [compiler-2:08792] mca: base: close: unloading component oob > > [compiler-2:08792] [[42190,0],0] TCP SHUTDOWN > > [compiler-2:08792] mca: base: close: component tcp closed > > [compiler-2:08792] mca: base: close: unloading component tcp > > > > > > > > Tue, 12 Aug 2014 16:14:58 +0400 от Timur Ismagilov <tismagi...@mail.ru>: > > Hello! > > > > I have Open MPI v1.8.2rc4r32485 > > > > When i run hello_c, I got this error message > > $mpirun -np 2 hello_c > > > > An ORTE daemon has unexpectedly failed after launch and before > > > > communicating back to mpirun. This could be caused by a number > > of factors, including an inability to create a connection back > > to mpirun due to a lack of common network interfaces and/or no > > route found between them. Please check network connectivity > > (including firewalls and network routing requirements). > > > > When i run with --debug-daemons --mca plm_base_verbose 5 -mca > > oob_base_verbose 10 -mca rml_base_verbose 10 i got this output: > > $mpirun --debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 10 > > -mca rml_base_verbose 10 -np 2 hello_c > > > > [compiler-2:08780] mca:base:select:( plm) Querying component [isolated] > > [compiler-2:08780] mca:base:select:( plm) Query of component [isolated] set > > priority to 0 > > [compiler-2:08780] mca:base:select:( plm) Querying component [rsh] > > [compiler-2:08780] mca:base:select:( plm) Query of component [rsh] set > > priority to 10 > > [compiler-2:08780] mca:base:select:( plm) Querying component [slurm] > > [compiler-2:08780] mca:base:select:( plm) Query of component [slurm] set > > priority to 75 > > [compiler-2:08780] mca:base:select:( plm) Selected component [slurm] > > [compiler-2:08780] mca: base: components_register: registering oob > > components > > [compiler-2:08780] mca: base: components_register: found loaded component > > tcp > > [compiler-2:08780] mca: base: components_register: component tcp register > > function successful > > [compiler-2:08780] mca: base: components_open: opening oob components > > [compiler-2:08780] mca: base: components_open: found loaded component tcp > > [compiler-2:08780] mca: base: components_open: component tcp open function > > successful > > [compiler-2:08780] mca:oob:select: checking available component tcp > > [compiler-2:08780] mca:oob:select: Querying component [tcp] > > [compiler-2:08780] oob:tcp: component_available called > > [compiler-2:08780] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > > [compiler-2:08780] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 > > [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.251.53 to our > > list of V4 connections > > [compiler-2:08780] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 > > [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.0.4 to our list > > of V4 connections > > [compiler-2:08780] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 > > [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.2.251.14 to our > > list of V4 connections > > [compiler-2:08780] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 > > [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.128.0.4 to our list > > of V4 connections > > [compiler-2:08780] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 > > [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 93.180.7.38 to our > > list of V4 connections > > [compiler-2:08780] [[42202,0],0] TCP STARTUP > > [compiler-2:08780] [[42202,0],0] attempting to bind to IPv4 port 0 > > [compiler-2:08780] [[42202,0],0] assigned IPv4 port 38420 > > [compiler-2:08780] mca:oob:select: Adding component to end > > [compiler-2:08780] mca:oob:select: Found 1 active transports > > [compiler-2:08780] mca: base: components_register: registering rml > > components > > [compiler-2:08780] mca: base: components_register: found loaded component > > oob > > [compiler-2:08780] mca: base: components_register: component oob has no > > register or open function > > [compiler-2:08780] mca: base: components_open: opening rml components > > [compiler-2:08780] mca: base: components_open: found loaded component oob > > [compiler-2:08780] mca: base: components_open: component oob open function > > successful > > [compiler-2:08780] orte_rml_base_select: initializing rml component oob > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 30 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 15 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 32 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 33 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 5 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 10 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 12 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 9 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 34 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 2 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 21 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 22 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 45 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 46 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 1 for peer > > [[WILDCARD],WILDCARD] > > [compiler-2:08780] [[42202,0],0] posting recv > > [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 27 for peer > > [[WILDCARD],WILDCARD] > > Daemon was launched on node1-130-08 - beginning to initialize > > Daemon was launched on node1-130-03 - beginning to initialize > > Daemon was launched on node1-130-05 - beginning to initialize > > Daemon was launched on node1-130-02 - beginning to initialize > > Daemon was launched on node1-130-01 - beginning to initialize > > Daemon was launched on node1-130-04 - beginning to initialize > > Daemon was launched on node1-130-07 - beginning to initialize > > Daemon was launched on node1-130-06 - beginning to initialize > > Daemon [[42202,0],3] checking in as pid 7178 on host node1-130-03 > > [node1-130-03:07178] [[42202,0],3] orted: up and running - waiting for > > commands! > > Daemon [[42202,0],2] checking in as pid 13581 on host node1-130-02 > > [node1-130-02:13581] [[42202,0],2] orted: up and running - waiting for > > commands! > > Daemon [[42202,0],1] checking in as pid 17220 on host node1-130-01 > > [node1-130-01:17220] [[42202,0],1] orted: up and running - waiting for > > commands! > > Daemon [[42202,0],5] checking in as pid 6663 on host node1-130-05 > > [node1-130-05:06663] [[42202,0],5] orted: up and running - waiting for > > commands! > > Daemon [[42202,0],8] checking in as pid 6683 on host node1-130-08 > > [node1-130-08:06683] [[42202,0],8] orted: up and running - waiting for > > commands! > > Daemon [[42202,0],7] checking in as pid 7877 on host node1-130-07 > > [node1-130-07:07877] [[42202,0],7] orted: up and running - waiting for > > commands! > > Daemon [[42202,0],4] checking in as pid 7735 on host node1-130-04 > > [node1-130-04:07735] [[42202,0],4] orted: up and running - waiting for > > commands! > > Daemon [[42202,0],6] checking in as pid 8451 on host node1-130-06 > > [node1-130-06:08451] [[42202,0],6] orted: up and running - waiting for > > commands! > > srun: error: node1-130-03: task 2: Exited with exit code 1 > > srun: Terminating job step 657040.1 > > srun: error: node1-130-02: task 1: Exited with exit code 1 > > slurmd[node1-130-04]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 WITH > > SIGNAL 9 *** > > slurmd[node1-130-07]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 WITH > > SIGNAL 9 *** > > slurmd[node1-130-06]: *** STEP 657040.1 KILLED AT 2014-08-12T12:59:07 WITH > > SIGNAL 9 *** > > srun: Job step aborted: Waiting up to 2 seconds for job step to finish. > > srun: error: node1-130-01: task 0: Exited with exit code 1 > > srun: error: node1-130-05: task 4: Exited with exit code 1 > > srun: error: node1-130-08: task 7: Exited with exit code 1 > > srun: error: node1-130-07: task 6: Exited with exit code 1 > > srun: error: node1-130-04: task 3: Killed > > srun: error: node1-130-06: task 5: Killed > > -------------------------------------------------------------------------- > > An ORTE daemon has unexpectedly failed after launch and before > > communicating back to mpirun. This could be caused by a number > > of factors, including an inability to create a connection back > > to mpirun due to a lack of common network interfaces and/or no > > route found between them. Please check network connectivity > > (including firewalls and network routing requirements). > > -------------------------------------------------------------------------- > > [compiler-2:08780] [[42202,0],0] orted_cmd: received halt_vm cmd > > [compiler-2:08780] mca: base: close: component oob closed > > [compiler-2:08780] mca: base: close: unloading component oob > > [compiler-2:08780] [[42202,0],0] TCP SHUTDOWN > > [compiler-2:08780] mca: base: close: component tcp closed > > [compiler-2:08780] mca: base: close: unloading component tcp > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2014/08/24987.php > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2014/08/24988.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/