When i try to specify oob with --mca oob_tcp_if_include <one of interface from ifconfig>, i alwase get error: $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------
Earlier, in ompi 1.8.1, I can not run mpi jobs without " --mca oob_tcp_if_include ib0 "... but now(ompi 1.9.a1) with this flag i get above error. Here is an output of ifconfig $ ifconfig eth1 Link encap:Ethernet HWaddr 00:15:17:EE:89:E1 inet addr:10.0.251.53 Bcast:10.0.251.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:215087433 errors:0 dropped:0 overruns:0 frame:0 TX packets:2648 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:26925754883 (25.0 GiB) TX bytes:137971 (134.7 KiB) Memory:b2c00000-b2c20000 eth2 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8 inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:4892833125 errors:0 dropped:0 overruns:0 frame:0 TX packets:8708606918 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1823986502132 (1.6 TiB) TX bytes:11957754120037 (10.8 TiB) eth2.911 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8 inet addr:93.180.7.38 Bcast:93.180.7.63 Mask:255.255.255.224 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:3746454225 errors:0 dropped:0 overruns:0 frame:0 TX packets:1131917608 errors:0 dropped:3 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:285174723322 (265.5 GiB) TX bytes:11523163526058 (10.4 TiB) eth3 Link encap:Ethernet HWaddr 00:02:C9:04:73:F9 inet addr:10.2.251.14 Bcast:10.2.251.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:591156692 errors:0 dropped:56 overruns:56 frame:56 TX packets:679729229 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:324195989293 (301.9 GiB) TX bytes:770299202886 (717.3 GiB) Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes. Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly. Ifconfig is obsolete! For replacement check ip. ib0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:10.128.0.4 Bcast:10.128.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:10843859 errors:0 dropped:0 overruns:0 frame:0 TX packets:8089839 errors:0 dropped:15 overruns:0 carrier:0 collisions:0 txqueuelen:1024 RX bytes:939249464 (895.7 MiB) TX bytes:886054008 (845.0 MiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:31235107 errors:0 dropped:0 overruns:0 frame:0 TX packets:31235107 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:132750916041 (123.6 GiB) TX bytes:132750916041 (123.6 GiB) Tue, 26 Aug 2014 09:48:35 -0700 от Ralph Castain <r...@open-mpi.org>: >I think something may be messed up with your installation. I went ahead and >tested this on a Slurm 2.5.4 cluster, and got the following: > >$ time mpirun -np 1 --host bend001 ./hello >Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12 > >real 0m0.086s >user 0m0.039s >sys 0m0.046s > >$ time mpirun -np 1 --host bend002 ./hello >Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12 > >real 0m0.528s >user 0m0.021s >sys 0m0.023s > >Which is what I would have expected. With --host set to the local host, no >daemons are being launched and so the time is quite short (just spent mapping >and fork/exec). With --host set to a single remote host, you have the time it >takes Slurm to launch our daemon on the remote host, so you get about half of >a second. > >IIRC, you were having some problems with the OOB setup. If you specify the TCP >interface to use, does your time come down? > > >On Aug 26, 2014, at 8:32 AM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>I'm using slurm 2.5.6 >> >>$salloc -N8 --exclusive -J ompi -p test >>$ srun hostname >>node1-128-21 >>node1-128-24 >>node1-128-22 >>node1-128-26 >>node1-128-27 >>node1-128-20 >>node1-128-25 >>node1-128-23 >>$ time mpirun -np 1 --host node1-128-21 ./hello_c >>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI >>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug >>21, 2014 (nightly snapshot tarball), 146) >>real 1m3.932s >>user 0m0.035s >>sys 0m0.072s >> >> >>Tue, 26 Aug 2014 07:03:58 -0700 от Ralph Castain < r...@open-mpi.org >: >>>hmmm....what is your allocation like? do you have a large hostfile, for >>>example? >>> >>>if you add a --host argument that contains just the local host, what is the >>>time for that scenario? >>> >>>On Aug 26, 2014, at 6:27 AM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>>>Hello! >>>>Here is my time results: >>>>$time mpirun -n 1 ./hello_c >>>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI >>>>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug >>>>21, 2014 (nightly snapshot tarball), 146) >>>>real 1m3.985s >>>>user 0m0.031s >>>>sys 0m0.083s >>>> >>>> >>>>Fri, 22 Aug 2014 07:43:03 -0700 от Ralph Castain < r...@open-mpi.org >: >>>>>I'm also puzzled by your timing statement - I can't replicate it: >>>>> >>>>>07:41:43 $ time mpirun -n 1 ./hello_c >>>>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI rhc@bend001 >>>>>Distribution, ident: 1.9a1r32577, repo rev: r32577, Unreleased developer >>>>>copy, 125) >>>>> >>>>>real 0m0.547s >>>>>user 0m0.043s >>>>>sys 0m0.046s >>>>> >>>>>The entire thing ran in 0.5 seconds >>>>> >>>>> >>>>>On Aug 22, 2014, at 6:33 AM, Mike Dubman < mi...@dev.mellanox.co.il > >>>>>wrote: >>>>>>Hi, >>>>>>The default delimiter is ";" . You can change delimiter with >>>>>>mca_base_env_list_delimiter. >>>>>> >>>>>> >>>>>> >>>>>>On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov < tismagi...@mail.ru > >>>>>> wrote: >>>>>>>Hello! >>>>>>>If i use latest night snapshot: >>>>>>>$ ompi_info -V >>>>>>>Open MPI v1.9a1r32570 >>>>>>>* In programm hello_c initialization takes ~1 min >>>>>>>In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less) >>>>>>>* if i use >>>>>>>$mpirun --mca mca_base_env_list >>>>>>>'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' --map-by slot:pe=8 -np 1 >>>>>>>./hello_c >>>>>>>i got error >>>>>>>config_parser.c:657 MXM ERROR Invalid value for SHM_KCOPY_MODE: >>>>>>>'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect] >>>>>>>but with -x all works fine (but with warn) >>>>>>>$mpirun -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c >>>>>>>WARNING: The mechanism by which environment variables are explicitly >>>>>>>.............. >>>>>>>.............. >>>>>>>.............. >>>>>>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI >>>>>>>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, >>>>>>>Aug 21, 2014 (nightly snapshot tarball), 146) >>>>>>>Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain < r...@open-mpi.org >: >>>>>>>>Not sure I understand. The problem has been fixed in both the trunk and >>>>>>>>the 1.8 branch now, so you should be able to work with either of those >>>>>>>>nightly builds. >>>>>>>> >>>>>>>>On Aug 21, 2014, at 12:02 AM, Timur Ismagilov < tismagi...@mail.ru > >>>>>>>>wrote: >>>>>>>>>Have i I any opportunity to run mpi jobs? >>>>>>>>> >>>>>>>>> >>>>>>>>>Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain < r...@open-mpi.org >: >>>>>>>>>>yes, i know - it is cmr'd >>>>>>>>>> >>>>>>>>>>On Aug 20, 2014, at 10:26 AM, Mike Dubman < mi...@dev.mellanox.co.il >>>>>>>>>>> wrote: >>>>>>>>>>>btw, we get same error in v1.8 branch as well. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain < r...@open-mpi.org >>>>>>>>>>>> wrote: >>>>>>>>>>>>It was not yet fixed - but should be now. >>>>>>>>>>>> >>>>>>>>>>>>On Aug 20, 2014, at 6:39 AM, Timur Ismagilov < tismagi...@mail.ru > >>>>>>>>>>>>wrote: >>>>>>>>>>>>>Hello! >>>>>>>>>>>>> >>>>>>>>>>>>>As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i >>>>>>>>>>>>>still have the problem >>>>>>>>>>>>> >>>>>>>>>>>>>a) >>>>>>>>>>>>>$ mpirun -np 1 ./hello_c >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>An ORTE daemon has unexpectedly failed after launch and before >>>>>>>>>>>>>communicating back to mpirun. This could be caused by a number >>>>>>>>>>>>>of factors, including an inability to create a connection back >>>>>>>>>>>>>to mpirun due to a lack of common network interfaces and/or no >>>>>>>>>>>>>route found between them. Please check network connectivity >>>>>>>>>>>>>(including firewalls and network routing requirements). >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>b) >>>>>>>>>>>>>$ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>An ORTE daemon has unexpectedly failed after launch and before >>>>>>>>>>>>>communicating back to mpirun. This could be caused by a number >>>>>>>>>>>>>of factors, including an inability to create a connection back >>>>>>>>>>>>>to mpirun due to a lack of common network interfaces and/or no >>>>>>>>>>>>>route found between them. Please check network connectivity >>>>>>>>>>>>>(including firewalls and network routing requirements). >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>> >>>>>>>>>>>>>c) >>>>>>>>>>>>> >>>>>>>>>>>>>$ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca >>>>>>>>>>>>>plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose >>>>>>>>>>>>>10 -np 1 ./hello_c >>>>>>>>>>>>>[compiler-2:14673] mca:base:select:( plm) Querying component >>>>>>>>>>>>>[isolated] >>>>>>>>>>>>>[compiler-2:14673] mca:base:select:( plm) Query of component >>>>>>>>>>>>>[isolated] set priority to 0 >>>>>>>>>>>>>[compiler-2:14673] mca:base:select:( plm) Querying component [rsh] >>>>>>>>>>>>>[compiler-2:14673] mca:base:select:( plm) Query of component [rsh] >>>>>>>>>>>>>set priority to 10 >>>>>>>>>>>>>[compiler-2:14673] mca:base:select:( plm) Querying component >>>>>>>>>>>>>[slurm] >>>>>>>>>>>>>[compiler-2:14673] mca:base:select:( plm) Query of component >>>>>>>>>>>>>[slurm] set priority to 75 >>>>>>>>>>>>>[compiler-2:14673] mca:base:select:( plm) Selected component >>>>>>>>>>>>>[slurm] >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_register: registering oob >>>>>>>>>>>>>components >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_register: found loaded >>>>>>>>>>>>>component tcp >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_register: component tcp >>>>>>>>>>>>>register function successful >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_open: opening oob >>>>>>>>>>>>>components >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_open: found loaded >>>>>>>>>>>>>component tcp >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_open: component tcp open >>>>>>>>>>>>>function successful >>>>>>>>>>>>>[compiler-2:14673] mca:oob:select: checking available component tcp >>>>>>>>>>>>>[compiler-2:14673] mca:oob:select: Querying component [tcp] >>>>>>>>>>>>>[compiler-2:14673] oob:tcp: component_available called >>>>>>>>>>>>>[compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>>>>>>>>>>>[compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>>>>>>>>>>>[compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>>>>>>>>>>>[compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>>>>>>>>>>>[compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to >>>>>>>>>>>>>our list of V4 connections >>>>>>>>>>>>>[compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] TCP STARTUP >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0 >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460 >>>>>>>>>>>>>[compiler-2:14673] mca:oob:select: Adding component to end >>>>>>>>>>>>>[compiler-2:14673] mca:oob:select: Found 1 active transports >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_register: registering rml >>>>>>>>>>>>>components >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_register: found loaded >>>>>>>>>>>>>component oob >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_register: component oob >>>>>>>>>>>>>has no register or open function >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_open: opening rml >>>>>>>>>>>>>components >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_open: found loaded >>>>>>>>>>>>>component oob >>>>>>>>>>>>>[compiler-2:14673] mca: base: components_open: component oob open >>>>>>>>>>>>>function successful >>>>>>>>>>>>>[compiler-2:14673] orte_rml_base_select: initializing rml >>>>>>>>>>>>>component oob >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 10 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 12 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 9 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 34 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 2 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 21 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 22 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 45 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 46 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 1 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting recv >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 27 >>>>>>>>>>>>>for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>Daemon was launched on node1-128-01 - beginning to initialize >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>WARNING: An invalid value was given for oob_tcp_if_include. This >>>>>>>>>>>>>value will be ignored. >>>>>>>>>>>>>Local host: node1-128-01 >>>>>>>>>>>>>Value: "ib0" >>>>>>>>>>>>>Message: Invalid specification (missing "/") >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>None of the TCP networks specified to be included for out-of-band >>>>>>>>>>>>>communications >>>>>>>>>>>>>could be found: >>>>>>>>>>>>>Value given: >>>>>>>>>>>>>Please revise the specification and try again. >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>No network interfaces were found for out-of-band communications. >>>>>>>>>>>>>We require >>>>>>>>>>>>>at least one available network for out-of-band messaging. >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>It looks like orte_init failed for some reason; your parallel >>>>>>>>>>>>>process is >>>>>>>>>>>>>likely to abort. There are many reasons that a parallel process can >>>>>>>>>>>>>fail during orte_init; some of which are due to configuration or >>>>>>>>>>>>>environment problems. This failure appears to be an internal >>>>>>>>>>>>>failure; >>>>>>>>>>>>>here's some additional information (which may only be relevant to >>>>>>>>>>>>>an >>>>>>>>>>>>>Open MPI developer): >>>>>>>>>>>>>orte_oob_base_select failed >>>>>>>>>>>>>--> Returned value (null) (-43) instead of ORTE_SUCCESS >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>srun: error: node1-128-01: task 0: Exited with exit code 213 >>>>>>>>>>>>>srun: Terminating job step 661215.0 >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>An ORTE daemon has unexpectedly failed after launch and before >>>>>>>>>>>>>communicating back to mpirun. This could be caused by a number >>>>>>>>>>>>>of factors, including an inability to create a connection back >>>>>>>>>>>>>to mpirun due to a lack of common network interfaces and/or no >>>>>>>>>>>>>route found between them. Please check network connectivity >>>>>>>>>>>>>(including firewalls and network routing requirements). >>>>>>>>>>>>>-------------------------------------------------------------------------- >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] orted_cmd: received halt_vm cmd >>>>>>>>>>>>>[compiler-2:14673] mca: base: close: component oob closed >>>>>>>>>>>>>[compiler-2:14673] mca: base: close: unloading component oob >>>>>>>>>>>>>[compiler-2:14673] [[49095,0],0] TCP SHUTDOWN >>>>>>>>>>>>>[compiler-2:14673] mca: base: close: component tcp closed >>>>>>>>>>>>>[compiler-2:14673] mca: base: close: unloading component tcp >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>Tue, 12 Aug 2014 18:33:24 +0000 от "Jeff Squyres (jsquyres)" < >>>>>>>>>>>>>jsquy...@cisco.com >: >>>>>>>>>>>>>>I filed the following ticket: >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/4857 >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>On Aug 12, 2014, at 12:39 PM, Jeff Squyres (jsquyres) < >>>>>>>>>>>>>>jsquy...@cisco.com > wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> (please keep the users list CC'ed) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We talked about this on the weekly engineering call today. >>>>>>>>>>>>>>> Ralph has an idea what is happening -- I need to do a little >>>>>>>>>>>>>>> investigation today and file a bug. I'll make sure you're CC'ed >>>>>>>>>>>>>>> on the bug ticket. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Aug 12, 2014, at 12:27 PM, Timur Ismagilov < >>>>>>>>>>>>>>> tismagi...@mail.ru > wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I don't have this error in OMPI 1.9a1r32252 and OMPI 1.8.1 >>>>>>>>>>>>>>>> (with --mca oob_tcp_if_include ib0), but in all latest night >>>>>>>>>>>>>>>> snapshots i got this error. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Tue, 12 Aug 2014 13:08:12 +0000 от "Jeff Squyres (jsquyres)" < >>>>>>>>>>>>>>>> jsquy...@cisco.com >: >>>>>>>>>>>>>>>> Are you running any kind of firewall on the node where mpirun >>>>>>>>>>>>>>>> is invoked? Open MPI needs to be able to use arbitrary TCP >>>>>>>>>>>>>>>> ports between the servers on which it runs. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This second mail seems to imply a bug in OMPI's >>>>>>>>>>>>>>>> oob_tcp_if_include param handling, however -- it's supposed to >>>>>>>>>>>>>>>> be able to handle an interface name (not just a network >>>>>>>>>>>>>>>> specification). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Ralph -- can you have a look? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Aug 12, 2014, at 8:41 AM, Timur Ismagilov < >>>>>>>>>>>>>>>> tismagi...@mail.ru > wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> When i add --mca oob_tcp_if_include ib0 (infiniband >>>>>>>>>>>>>>>>> interface) to mpirun (as it was here: >>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/07/24857.php >>>>>>>>>>>>>>>>> ) i got this output: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Querying component >>>>>>>>>>>>>>>>> [isolated] >>>>>>>>>>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Query of component >>>>>>>>>>>>>>>>> [isolated] set priority to 0 >>>>>>>>>>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Querying component >>>>>>>>>>>>>>>>> [rsh] >>>>>>>>>>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Query of component >>>>>>>>>>>>>>>>> [rsh] set priority to 10 >>>>>>>>>>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Querying component >>>>>>>>>>>>>>>>> [slurm] >>>>>>>>>>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Query of component >>>>>>>>>>>>>>>>> [slurm] set priority to 75 >>>>>>>>>>>>>>>>> [compiler-2:08792] mca:base:select:( plm) Selected component >>>>>>>>>>>>>>>>> [slurm] >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_register: >>>>>>>>>>>>>>>>> registering oob components >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_register: found >>>>>>>>>>>>>>>>> loaded component tcp >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_register: component >>>>>>>>>>>>>>>>> tcp register function successful >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_open: opening oob >>>>>>>>>>>>>>>>> components >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_open: found loaded >>>>>>>>>>>>>>>>> component tcp >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_open: component tcp >>>>>>>>>>>>>>>>> open function successful >>>>>>>>>>>>>>>>> [compiler-2:08792] mca:oob:select: checking available >>>>>>>>>>>>>>>>> component tcp >>>>>>>>>>>>>>>>> [compiler-2:08792] mca:oob:select: Querying component [tcp] >>>>>>>>>>>>>>>>> [compiler-2:08792] oob:tcp: component_available called >>>>>>>>>>>>>>>>> [compiler-2:08792] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08792] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08792] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08792] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08792] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] oob:tcp:init adding >>>>>>>>>>>>>>>>> 10.128.0.4 to our list of V4 connections >>>>>>>>>>>>>>>>> [compiler-2:08792] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] TCP STARTUP >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] attempting to bind to IPv4 >>>>>>>>>>>>>>>>> port 0 >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] assigned IPv4 port 53883 >>>>>>>>>>>>>>>>> [compiler-2:08792] mca:oob:select: Adding component to end >>>>>>>>>>>>>>>>> [compiler-2:08792] mca:oob:select: Found 1 active transports >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_register: >>>>>>>>>>>>>>>>> registering rml components >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_register: found >>>>>>>>>>>>>>>>> loaded component oob >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_register: component >>>>>>>>>>>>>>>>> oob has no register or open function >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_open: opening rml >>>>>>>>>>>>>>>>> components >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_open: found loaded >>>>>>>>>>>>>>>>> component oob >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: components_open: component oob >>>>>>>>>>>>>>>>> open function successful >>>>>>>>>>>>>>>>> [compiler-2:08792] orte_rml_base_select: initializing rml >>>>>>>>>>>>>>>>> component oob >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 30 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 15 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 32 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 33 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 5 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 10 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 12 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 9 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 34 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 2 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 21 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 22 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 45 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 46 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 1 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 27 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> Daemon was launched on node1-128-01 - beginning to initialize >>>>>>>>>>>>>>>>> Daemon was launched on node1-128-02 - beginning to initialize >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> WARNING: An invalid value was given for oob_tcp_if_include. >>>>>>>>>>>>>>>>> This >>>>>>>>>>>>>>>>> value will be ignored. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Local host: node1-128-01 >>>>>>>>>>>>>>>>> Value: "ib0" >>>>>>>>>>>>>>>>> Message: Invalid specification (missing "/") >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> WARNING: An invalid value was given for oob_tcp_if_include. >>>>>>>>>>>>>>>>> This >>>>>>>>>>>>>>>>> value will be ignored. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Local host: node1-128-02 >>>>>>>>>>>>>>>>> Value: "ib0" >>>>>>>>>>>>>>>>> Message: Invalid specification (missing "/") >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> None of the TCP networks specified to be included for >>>>>>>>>>>>>>>>> out-of-band communications >>>>>>>>>>>>>>>>> could be found: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Value given: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Please revise the specification and try again. >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> None of the TCP networks specified to be included for >>>>>>>>>>>>>>>>> out-of-band communications >>>>>>>>>>>>>>>>> could be found: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Value given: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Please revise the specification and try again. >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> No network interfaces were found for out-of-band >>>>>>>>>>>>>>>>> communications. We require >>>>>>>>>>>>>>>>> at least one available network for out-of-band messaging. >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> No network interfaces were found for out-of-band >>>>>>>>>>>>>>>>> communications. We require >>>>>>>>>>>>>>>>> at least one available network for out-of-band messaging. >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> It looks like orte_init failed for some reason; your parallel >>>>>>>>>>>>>>>>> process is >>>>>>>>>>>>>>>>> likely to abort. There are many reasons that a parallel >>>>>>>>>>>>>>>>> process can >>>>>>>>>>>>>>>>> fail during orte_init; some of which are due to configuration >>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>> environment problems. This failure appears to be an internal >>>>>>>>>>>>>>>>> failure; >>>>>>>>>>>>>>>>> here's some additional information (which may only be >>>>>>>>>>>>>>>>> relevant to an >>>>>>>>>>>>>>>>> Open MPI developer): >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> orte_oob_base_select failed >>>>>>>>>>>>>>>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> It looks like orte_init failed for some reason; your parallel >>>>>>>>>>>>>>>>> process is >>>>>>>>>>>>>>>>> likely to abort. There are many reasons that a parallel >>>>>>>>>>>>>>>>> process can >>>>>>>>>>>>>>>>> fail during orte_init; some of which are due to configuration >>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>> environment problems. This failure appears to be an internal >>>>>>>>>>>>>>>>> failure; >>>>>>>>>>>>>>>>> here's some additional information (which may only be >>>>>>>>>>>>>>>>> relevant to an >>>>>>>>>>>>>>>>> Open MPI developer): >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> orte_oob_base_select failed >>>>>>>>>>>>>>>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> srun: error: node1-128-02: task 1: Exited with exit code 213 >>>>>>>>>>>>>>>>> srun: Terminating job step 657300.0 >>>>>>>>>>>>>>>>> srun: error: node1-128-01: task 0: Exited with exit code 213 >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>>>>>>>>>>>> communicating back to mpirun. This could be caused by a number >>>>>>>>>>>>>>>>> of factors, including an inability to create a connection back >>>>>>>>>>>>>>>>> to mpirun due to a lack of common network interfaces and/or no >>>>>>>>>>>>>>>>> route found between them. Please check network connectivity >>>>>>>>>>>>>>>>> (including firewalls and network routing requirements). >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] orted_cmd: received halt_vm >>>>>>>>>>>>>>>>> cmd >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: close: component oob closed >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: close: unloading component oob >>>>>>>>>>>>>>>>> [compiler-2:08792] [[42190,0],0] TCP SHUTDOWN >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: close: component tcp closed >>>>>>>>>>>>>>>>> [compiler-2:08792] mca: base: close: unloading component tcp >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Tue, 12 Aug 2014 16:14:58 +0400 от Timur Ismagilov < >>>>>>>>>>>>>>>>> tismagi...@mail.ru >: >>>>>>>>>>>>>>>>> Hello! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have Open MPI v1.8.2rc4r32485 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> When i run hello_c, I got this error message >>>>>>>>>>>>>>>>> $mpirun -np 2 hello_c >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> communicating back to mpirun. This could be caused by a number >>>>>>>>>>>>>>>>> of factors, including an inability to create a connection back >>>>>>>>>>>>>>>>> to mpirun due to a lack of common network interfaces and/or no >>>>>>>>>>>>>>>>> route found between them. Please check network connectivity >>>>>>>>>>>>>>>>> (including firewalls and network routing requirements). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> When i run with --debug-daemons --mca plm_base_verbose 5 -mca >>>>>>>>>>>>>>>>> oob_base_verbose 10 -mca rml_base_verbose 10 i got this >>>>>>>>>>>>>>>>> output: >>>>>>>>>>>>>>>>> $mpirun --debug-daemons --mca plm_base_verbose 5 -mca >>>>>>>>>>>>>>>>> oob_base_verbose 10 -mca rml_base_verbose 10 -np 2 hello_c >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Querying component >>>>>>>>>>>>>>>>> [isolated] >>>>>>>>>>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Query of component >>>>>>>>>>>>>>>>> [isolated] set priority to 0 >>>>>>>>>>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Querying component >>>>>>>>>>>>>>>>> [rsh] >>>>>>>>>>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Query of component >>>>>>>>>>>>>>>>> [rsh] set priority to 10 >>>>>>>>>>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Querying component >>>>>>>>>>>>>>>>> [slurm] >>>>>>>>>>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Query of component >>>>>>>>>>>>>>>>> [slurm] set priority to 75 >>>>>>>>>>>>>>>>> [compiler-2:08780] mca:base:select:( plm) Selected component >>>>>>>>>>>>>>>>> [slurm] >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_register: >>>>>>>>>>>>>>>>> registering oob components >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_register: found >>>>>>>>>>>>>>>>> loaded component tcp >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_register: component >>>>>>>>>>>>>>>>> tcp register function successful >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_open: opening oob >>>>>>>>>>>>>>>>> components >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_open: found loaded >>>>>>>>>>>>>>>>> component tcp >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_open: component tcp >>>>>>>>>>>>>>>>> open function successful >>>>>>>>>>>>>>>>> [compiler-2:08780] mca:oob:select: checking available >>>>>>>>>>>>>>>>> component tcp >>>>>>>>>>>>>>>>> [compiler-2:08780] mca:oob:select: Querying component [tcp] >>>>>>>>>>>>>>>>> [compiler-2:08780] oob:tcp: component_available called >>>>>>>>>>>>>>>>> [compiler-2:08780] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08780] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding >>>>>>>>>>>>>>>>> 10.0.251.53 to our list of V4 connections >>>>>>>>>>>>>>>>> [compiler-2:08780] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.0.4 >>>>>>>>>>>>>>>>> to our list of V4 connections >>>>>>>>>>>>>>>>> [compiler-2:08780] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding >>>>>>>>>>>>>>>>> 10.2.251.14 to our list of V4 connections >>>>>>>>>>>>>>>>> [compiler-2:08780] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding >>>>>>>>>>>>>>>>> 10.128.0.4 to our list of V4 connections >>>>>>>>>>>>>>>>> [compiler-2:08780] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: >>>>>>>>>>>>>>>>> V4 >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding >>>>>>>>>>>>>>>>> 93.180.7.38 to our list of V4 connections >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] TCP STARTUP >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] attempting to bind to IPv4 >>>>>>>>>>>>>>>>> port 0 >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] assigned IPv4 port 38420 >>>>>>>>>>>>>>>>> [compiler-2:08780] mca:oob:select: Adding component to end >>>>>>>>>>>>>>>>> [compiler-2:08780] mca:oob:select: Found 1 active transports >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_register: >>>>>>>>>>>>>>>>> registering rml components >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_register: found >>>>>>>>>>>>>>>>> loaded component oob >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_register: component >>>>>>>>>>>>>>>>> oob has no register or open function >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_open: opening rml >>>>>>>>>>>>>>>>> components >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_open: found loaded >>>>>>>>>>>>>>>>> component oob >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: components_open: component oob >>>>>>>>>>>>>>>>> open function successful >>>>>>>>>>>>>>>>> [compiler-2:08780] orte_rml_base_select: initializing rml >>>>>>>>>>>>>>>>> component oob >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 30 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 15 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 32 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 33 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 5 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 10 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 12 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 9 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 34 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 2 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 21 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 22 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 45 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 46 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 1 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] posting persistent recv on >>>>>>>>>>>>>>>>> tag 27 for peer [[WILDCARD],WILDCARD] >>>>>>>>>>>>>>>>> Daemon was launched on node1-130-08 - beginning to initialize >>>>>>>>>>>>>>>>> Daemon was launched on node1-130-03 - beginning to initialize >>>>>>>>>>>>>>>>> Daemon was launched on node1-130-05 - beginning to initialize >>>>>>>>>>>>>>>>> Daemon was launched on node1-130-02 - beginning to initialize >>>>>>>>>>>>>>>>> Daemon was launched on node1-130-01 - beginning to initialize >>>>>>>>>>>>>>>>> Daemon was launched on node1-130-04 - beginning to initialize >>>>>>>>>>>>>>>>> Daemon was launched on node1-130-07 - beginning to initialize >>>>>>>>>>>>>>>>> Daemon was launched on node1-130-06 - beginning to initialize >>>>>>>>>>>>>>>>> Daemon [[42202,0],3] checking in as pid 7178 on host >>>>>>>>>>>>>>>>> node1-130-03 >>>>>>>>>>>>>>>>> [node1-130-03:07178] [[42202,0],3] orted: up and running - >>>>>>>>>>>>>>>>> waiting for commands! >>>>>>>>>>>>>>>>> Daemon [[42202,0],2] checking in as pid 13581 on host >>>>>>>>>>>>>>>>> node1-130-02 >>>>>>>>>>>>>>>>> [node1-130-02:13581] [[42202,0],2] orted: up and running - >>>>>>>>>>>>>>>>> waiting for commands! >>>>>>>>>>>>>>>>> Daemon [[42202,0],1] checking in as pid 17220 on host >>>>>>>>>>>>>>>>> node1-130-01 >>>>>>>>>>>>>>>>> [node1-130-01:17220] [[42202,0],1] orted: up and running - >>>>>>>>>>>>>>>>> waiting for commands! >>>>>>>>>>>>>>>>> Daemon [[42202,0],5] checking in as pid 6663 on host >>>>>>>>>>>>>>>>> node1-130-05 >>>>>>>>>>>>>>>>> [node1-130-05:06663] [[42202,0],5] orted: up and running - >>>>>>>>>>>>>>>>> waiting for commands! >>>>>>>>>>>>>>>>> Daemon [[42202,0],8] checking in as pid 6683 on host >>>>>>>>>>>>>>>>> node1-130-08 >>>>>>>>>>>>>>>>> [node1-130-08:06683] [[42202,0],8] orted: up and running - >>>>>>>>>>>>>>>>> waiting for commands! >>>>>>>>>>>>>>>>> Daemon [[42202,0],7] checking in as pid 7877 on host >>>>>>>>>>>>>>>>> node1-130-07 >>>>>>>>>>>>>>>>> [node1-130-07:07877] [[42202,0],7] orted: up and running - >>>>>>>>>>>>>>>>> waiting for commands! >>>>>>>>>>>>>>>>> Daemon [[42202,0],4] checking in as pid 7735 on host >>>>>>>>>>>>>>>>> node1-130-04 >>>>>>>>>>>>>>>>> [node1-130-04:07735] [[42202,0],4] orted: up and running - >>>>>>>>>>>>>>>>> waiting for commands! >>>>>>>>>>>>>>>>> Daemon [[42202,0],6] checking in as pid 8451 on host >>>>>>>>>>>>>>>>> node1-130-06 >>>>>>>>>>>>>>>>> [node1-130-06:08451] [[42202,0],6] orted: up and running - >>>>>>>>>>>>>>>>> waiting for commands! >>>>>>>>>>>>>>>>> srun: error: node1-130-03: task 2: Exited with exit code 1 >>>>>>>>>>>>>>>>> srun: Terminating job step 657040.1 >>>>>>>>>>>>>>>>> srun: error: node1-130-02: task 1: Exited with exit code 1 >>>>>>>>>>>>>>>>> slurmd[node1-130-04]: *** STEP 657040.1 KILLED AT >>>>>>>>>>>>>>>>> 2014-08-12T12:59:07 WITH SIGNAL 9 *** >>>>>>>>>>>>>>>>> slurmd[node1-130-07]: *** STEP 657040.1 KILLED AT >>>>>>>>>>>>>>>>> 2014-08-12T12:59:07 WITH SIGNAL 9 *** >>>>>>>>>>>>>>>>> slurmd[node1-130-06]: *** STEP 657040.1 KILLED AT >>>>>>>>>>>>>>>>> 2014-08-12T12:59:07 WITH SIGNAL 9 *** >>>>>>>>>>>>>>>>> srun: Job step aborted: Waiting up to 2 seconds for job step >>>>>>>>>>>>>>>>> to finish. >>>>>>>>>>>>>>>>> srun: error: node1-130-01: task 0: Exited with exit code 1 >>>>>>>>>>>>>>>>> srun: error: node1-130-05: task 4: Exited with exit code 1 >>>>>>>>>>>>>>>>> srun: error: node1-130-08: task 7: Exited with exit code 1 >>>>>>>>>>>>>>>>> srun: error: node1-130-07: task 6: Exited with exit code 1 >>>>>>>>>>>>>>>>> srun: error: node1-130-04: task 3: Killed >>>>>>>>>>>>>>>>> srun: error: node1-130-06: task 5: Killed >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>>>>>>>>>>>> communicating back to mpirun. This could be caused by a number >>>>>>>>>>>>>>>>> of factors, including an inability to create a connection back >>>>>>>>>>>>>>>>> to mpirun due to a lack of common network interfaces and/or no >>>>>>>>>>>>>>>>> route found between them. Please check network connectivity >>>>>>>>>>>>>>>>> (including firewalls and network routing requirements). >>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] orted_cmd: received halt_vm >>>>>>>>>>>>>>>>> cmd >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: close: component oob closed >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: close: unloading component oob >>>>>>>>>>>>>>>>> [compiler-2:08780] [[42202,0],0] TCP SHUTDOWN >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: close: component tcp closed >>>>>>>>>>>>>>>>> [compiler-2:08780] mca: base: close: unloading component tcp >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/24987.php >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/24988.php >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> Subscription: >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25001.php >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>-- >>>>>>>>>>>>>>Jeff Squyres >>>>>>>>>>>>>>jsquy...@cisco.com >>>>>>>>>>>>>>For corporate legal information go to: >>>>>>>>>>>>>>http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>_______________________________________________ >>>>>>>>>>>>>users mailing list >>>>>>>>>>>>>us...@open-mpi.org >>>>>>>>>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>Link to this post: >>>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2014/08/25086.php >>>>>>>>>>>> >>>>>>>>>>>>_______________________________________________ >>>>>>>>>>>>users mailing list >>>>>>>>>>>>us...@open-mpi.org >>>>>>>>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>Link to this post: >>>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2014/08/25093.php >>>>>>>>>>> >>>>>>>>>>>_______________________________________________ >>>>>>>>>>>users mailing list >>>>>>>>>>>us...@open-mpi.org >>>>>>>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>Link to this post: >>>>>>>>>>>http://www.open-mpi.org/community/lists/users/2014/08/25094.php >>>>>>>>>>_______________________________________________ >>>>>>>>>>users mailing list >>>>>>>>>>us...@open-mpi.org >>>>>>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>Link to this post: >>>>>>>>>>http://www.open-mpi.org/community/lists/users/2014/08/25095.php >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>_______________________________________________ >>>>>>>>>users mailing list >>>>>>>>>us...@open-mpi.org >>>>>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>Link to this post: >>>>>>>>>http://www.open-mpi.org/community/lists/users/2014/08/25105.php >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>_______________________________________________ >>>>>>>users mailing list >>>>>>>us...@open-mpi.org >>>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>Link to this post: >>>>>>>http://www.open-mpi.org/community/lists/users/2014/08/25127.php >>>>>> >>>>>> >>>>>> >>>>>>-- >>>>>> >>>>>>Kind Regards, >>>>>> >>>>>>M. _______________________________________________ >>>>>>users mailing list >>>>>>us...@open-mpi.org >>>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>Link to this post: >>>>>>http://www.open-mpi.org/community/lists/users/2014/08/25128.php >>>>>_______________________________________________ >>>>>users mailing list >>>>>us...@open-mpi.org >>>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>Link to this post: >>>>>http://www.open-mpi.org/community/lists/users/2014/08/25129.php >>>> >>>> >>>> >> >> >> >> >>---------------------------------------------------------------------- >> >> >>_______________________________________________ >>users mailing list >>us...@open-mpi.org >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>Link to this post: >>http://www.open-mpi.org/community/lists/users/2014/08/25154.php