How bizarre. Please add "--leave-session-attached -mca oob_base_verbose 100" to your cmd line
On Aug 27, 2014, at 4:31 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: > When i try to specify oob with --mca oob_tcp_if_include <one of interface > from ifconfig>, i alwase get error: > > $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c > -------------------------------------------------------------------------- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > ------------------------------------------------------------------------- > > Earlier, in ompi 1.8.1, I can not run mpi jobs without " --mca > oob_tcp_if_include ib0 "... but now(ompi 1.9.a1) with this flag i get above > error. > > Here is an output of ifconfig > > $ ifconfig > eth1 Link encap:Ethernet HWaddr 00:15:17:EE:89:E1 > inet addr:10.0.251.53 Bcast:10.0.251.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:215087433 errors:0 dropped:0 overruns:0 frame:0 > TX packets:2648 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:26925754883 (25.0 GiB) TX bytes:137971 (134.7 KiB) > Memory:b2c00000-b2c20000 > > eth2 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8 > inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:4892833125 errors:0 dropped:0 overruns:0 frame:0 > TX packets:8708606918 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:1823986502132 (1.6 TiB) TX bytes:11957754120037 (10.8 TiB) > > eth2.911 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8 > inet addr:93.180.7.38 Bcast:93.180.7.63 Mask:255.255.255.224 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:3746454225 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1131917608 errors:0 dropped:3 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:285174723322 (265.5 GiB) TX bytes:11523163526058 (10.4 TiB) > > eth3 Link encap:Ethernet HWaddr 00:02:C9:04:73:F9 > inet addr:10.2.251.14 Bcast:10.2.251.255 Mask:255.255.255.0 > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:591156692 errors:0 dropped:56 overruns:56 frame:56 > TX packets:679729229 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:324195989293 (301.9 GiB) TX bytes:770299202886 (717.3 GiB) > > Ifconfig uses the ioctl access method to get the full address information, > which limits hardware addresses to 8 bytes. > Because Infiniband address has 20 bytes, only the first 8 bytes are displayed > correctly. > Ifconfig is obsolete! For replacement check ip. > ib0 Link encap:InfiniBand HWaddr > 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 > inet addr:10.128.0.4 Bcast:10.128.255.255 Mask:255.255.0.0 > UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 > RX packets:10843859 errors:0 dropped:0 overruns:0 frame:0 > TX packets:8089839 errors:0 dropped:15 overruns:0 carrier:0 > collisions:0 txqueuelen:1024 > RX bytes:939249464 (895.7 MiB) TX bytes:886054008 (845.0 MiB) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:31235107 errors:0 dropped:0 overruns:0 frame:0 > TX packets:31235107 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:132750916041 (123.6 GiB) TX bytes:132750916041 (123.6 GiB) > > > > > Tue, 26 Aug 2014 09:48:35 -0700 от Ralph Castain <r...@open-mpi.org>: > > I think something may be messed up with your installation. I went ahead and > tested this on a Slurm 2.5.4 cluster, and got the following: > > $ time mpirun -np 1 --host bend001 ./hello > Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12 > > real 0m0.086s > user 0m0.039s > sys 0m0.046s > > $ time mpirun -np 1 --host bend002 ./hello > Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12 > > real 0m0.528s > user 0m0.021s > sys 0m0.023s > > Which is what I would have expected. With --host set to the local host, no > daemons are being launched and so the time is quite short (just spent mapping > and fork/exec). With --host set to a single remote host, you have the time it > takes Slurm to launch our daemon on the remote host, so you get about half of > a second. > > IIRC, you were having some problems with the OOB setup. If you specify the > TCP interface to use, does your time come down? > > > On Aug 26, 2014, at 8:32 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: > >> I'm using slurm 2.5.6 >> >> $salloc -N8 --exclusive -J ompi -p test >> >> $ srun hostname >> node1-128-21 >> node1-128-24 >> node1-128-22 >> node1-128-26 >> node1-128-27 >> node1-128-20 >> node1-128-25 >> node1-128-23 >> >> $ time mpirun -np 1 --host node1-128-21 ./hello_c >> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI >> semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug >> 21, 2014 (nightly snapshot tarball), 146) >> >> real 1m3.932s >> user 0m0.035s >> sys 0m0.072s >> >> >> >> >> Tue, 26 Aug 2014 07:03:58 -0700 от Ralph Castain <r...@open-mpi.org>: >> hmmm....what is your allocation like? do you have a large hostfile, for >> example? >> >> if you add a --host argument that contains just the local host, what is the >> time for that scenario? >> >> On Aug 26, 2014, at 6:27 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: >> >>> Hello! >>> Here is my time results: >>> >>> $time mpirun -n 1 ./hello_c >>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI >>> semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug >>> 21, 2014 (nightly snapshot tarball), 146) >>> >>> real 1m3.985s >>> user 0m0.031s >>> sys 0m0.083s >>> >>> >>> >>> >>> Fri, 22 Aug 2014 07:43:03 -0700 от Ralph Castain <r...@open-mpi.org>: >>> I'm also puzzled by your timing statement - I can't replicate it: >>> >>> 07:41:43 $ time mpirun -n 1 ./hello_c >>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI rhc@bend001 >>> Distribution, ident: 1.9a1r32577, repo rev: r32577, Unreleased developer >>> copy, 125) >>> >>> real 0m0.547s >>> user 0m0.043s >>> sys 0m0.046s >>> >>> The entire thing ran in 0.5 seconds >>> >>> >>> On Aug 22, 2014, at 6:33 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: >>> >>>> Hi, >>>> The default delimiter is ";" . You can change delimiter with >>>> mca_base_env_list_delimiter. >>>> >>>> >>>> >>>> On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov <tismagi...@mail.ru> >>>> wrote: >>>> Hello! >>>> If i use latest night snapshot: >>>> $ ompi_info -V >>>> Open MPI v1.9a1r32570 >>>> >>>> In programm hello_c initialization takes ~1 min >>>> In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less) >>>> if i use >>>> $mpirun --mca mca_base_env_list >>>> 'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' --map-by slot:pe=8 -np 1 >>>> ./hello_c >>>> i got error >>>> config_parser.c:657 MXM ERROR Invalid value for SHM_KCOPY_MODE: >>>> 'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect] >>>> but with -x all works fine (but with warn) >>>> $mpirun -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c >>>> WARNING: The mechanism by which environment variables are explicitly >>>> .............. >>>> .............. >>>> .............. >>>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI >>>> semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug >>>> 21, 2014 (nightly snapshot tarball), 146) >>>> >>>> >>>> Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain <r...@open-mpi.org>: >>>> Not sure I understand. The problem has been fixed in both the trunk and >>>> the 1.8 branch now, so you should be able to work with either of those >>>> nightly builds. >>>> >>>> On Aug 21, 2014, at 12:02 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: >>>> >>>>> Have i I any opportunity to run mpi jobs? >>>>> >>>>> >>>>> Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain <r...@open-mpi.org>: >>>>> yes, i know - it is cmr'd >>>>> >>>>> On Aug 20, 2014, at 10:26 AM, Mike Dubman <mi...@dev.mellanox.co.il> >>>>> wrote: >>>>> >>>>>> btw, we get same error in v1.8 branch as well. >>>>>> >>>>>> >>>>>> On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>> It was not yet fixed - but should be now. >>>>>> >>>>>> On Aug 20, 2014, at 6:39 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: >>>>>> >>>>>>> Hello! >>>>>>> >>>>>>> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i still >>>>>>> have the problem >>>>>>> >>>>>>> a) >>>>>>> $ mpirun -np 1 ./hello_c >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>> communicating back to mpirun. This could be caused by a number >>>>>>> of factors, including an inability to create a connection back >>>>>>> to mpirun due to a lack of common network interfaces and/or no >>>>>>> route found between them. Please check network connectivity >>>>>>> (including firewalls and network routing requirements). >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> b) >>>>>>> $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c >>>>>>> -------------------------------------------------------------------------- >>>>>>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>> communicating back to mpirun. This could be caused by a number >>>>>>> of factors, including an inability to create a connection back >>>>>>> to mpirun due to a lack of common network interfaces and/or no >>>>>>> route found between them. Please check network connectivity >>>>>>> (including firewalls and network routing requirements). >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> c) >>>>>>> >>>>>>> $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca >>>>>>> plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 >>>>>>> -np 1 ./hello_c >>>>>>> >>>>>>> [compiler-2:14673] mca:base:select:( plm) Querying component [isolated] >>>>>>> [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] >>>>>>> set priority to 0 >>>>>>> [compiler-2:14673] mca:base:select:( plm) Querying component [rsh] >>>>>>> [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set >>>>>>> priority to 10 >>>>>>> [compiler-2:14673] mca:base:select:( plm) Querying component [slurm] >>>>>>> [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] >>>>>>> set priority to 75 >>>>>>> [compiler-2:14673] mca:base:select:( plm) Selected component [slurm] >>>>>>> [compiler-2:14673] mca: base: components_register: registering oob >>>>>>> components >>>>>>> [compiler-2:14673] mca: base: components_register: found loaded >>>>>>> component tcp >>>>>>> [compiler-2:14673] mca: base: components_register: component tcp >>>>>>> register function successful >>>>>>> [compiler-2:14673] mca: base: components_open: opening oob components >>>>>>> [compiler-2:14673] mca: base: components_open: found loaded component >>>>>>> tcp >>>>>>> [compiler-2:14673] mca: base: components_open: component tcp open >>>>>>> function successful >>>>>>> [compiler-2:14673] mca:oob:select: checking available component tcp >>>>>>> [compiler-2:14673] mca:oob:select: Querying component [tcp] >>>>>>> [compiler-2:14673] oob:tcp: component_available called >>>>>>> [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>>>>> [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>>>>> [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>>>>> [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>>>>> [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>>>>> [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our >>>>>>> list of V4 connections >>>>>>> [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>>>>> [compiler-2:14673] [[49095,0],0] TCP STARTUP >>>>>>> [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0 >>>>>>> [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460 >>>>>>> [compiler-2:14673] mca:oob:select: Adding component to end >>>>>>> [compiler-2:14673] mca:oob:select: Found 1 active transports >>>>>>> [compiler-2:14673] mca: base: components_register: registering rml >>>>>>> components >>>>>>> [compiler-2:14673] mca: base: components_register: found loaded >>>>>>> component oob >>>>>>> [compiler-2:14673] mca: base: components_register: component oob has no >>>>>>> register or open function >>>>>>> [compiler-2:14673] mca: base: components_open: opening rml components >>>>>>> [compiler-2:14673] mca: base: components_open: found loaded component >>>>>>> oob >>>>>>> [compiler-2:14673] mca: base: components_open: component oob open >>>>>>> function successful >>>>>>> [compiler-2:14673] orte_rml_base_select: initializing rml component oob >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 5 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 10 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 12 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 9 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 34 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 2 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 21 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 22 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 45 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 46 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 1 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> [compiler-2:14673] [[49095,0],0] posting recv >>>>>>> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 27 for >>>>>>> peer [[WILDCARD],WILDCARD] >>>>>>> Daemon was launched on node1-128-01 - beginning to initialize >>>>>>> -------------------------------------------------------------------------- >>>>>>> WARNING: An invalid value was given for oob_tcp_if_include. This >>>>>>> value will be ignored. >>>>>>> >>>>>>> Local host: node1-128-01 >>>>>>> Value: "ib0" >>>>>>> Message: Invalid specification (missing "/") >>>>>>> -------------------------------------------------------------------------- >>>>>>> -------------------------------------------------------------------------- >>>>>>> None of the TCP networks specified to be included for out-of-band >>>>>>> communications >>>>>>> could be found: >>>>>>> >>>>>>> Value given: >>>>>>> >>>>>>> Please revise the specification and try again. >>>>>>> -------------------------------------------------------------------------- >>>>>>> -------------------------------------------------------------------------- >>>>>>> No network interfaces were found for out-of-band communications. We >>>>>>> require >>>>>>> at least one available network for out-of-band messaging. >>>>>>> -------------------------------------------------------------------------- >>>>>>> -------------------------------------------------------------------------- >>>>>>> It looks like orte_init failed for some reason; your parallel process is >>>>>>> likely to abort. There are many reasons that a parallel process can >>>>>>> fail during orte_init; some of which are due to configuration or >>>>>>> environment problems. This failure appears to be an internal failure; >>>>>>> here's some additional information (which may only be relevant to an >>>>>>> Open MPI developer): >>>>>>> >>>>>>> orte_oob_base_select failed >>>>>>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>>>>> -------------------------------------------------------------------------- >>>>>>> srun: error: node1-128-01: task 0: Exited with exit code 213 >>>>>>> srun: Terminating job step 661215.0 >>>>>>> -------------------------------------------------------------------------- >>>>>>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>> communicating back to mpirun. This could be caused by a number >>>>>>> of factors, including an inability to create a connection back >>>>>>> to mpirun due to a lack of common network interfaces and/or no >>>>>>> route found between them. Please check network connectivity >>>>>>> (including firewalls and network routing requirements). >>>>>>> -------------------------------------------------------------------------- >>>>>>> [compiler-2:14673] [[49095,0],0] orted_cmd: received halt_vm cmd >>>>>>> [compiler-2:14673] mca: base: close: component oob closed >>>>>>> [compiler-2:14673] mca: base: close: unloading component oob >>>>>>> [compiler-2:14673] [[49095,0],0] TCP SHUTDOWN >>>>>>> [compiler-2:14673] mca: base: close: component tcp closed >>>>>>> [compiler-2:14673] mca: base: close: unloading component tcp >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Tue, 12 Aug 2014 18:33:24 +0000 от "Jeff Squyres (jsquyres)" >>>>>>> <jsquy...@cisco.com>: >>>>>>> I filed the following ticket: >>>>>>> >>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/4857 >>>>>>> >>>>>>> >>>>>>> On Aug 12, 2014, at 12:39 PM, Jeff Squyres (jsquyres) >>>>>>> <jsquy...@cisco.com> wrote: >>>>>>> >>>>>>> > (please keep the users list CC'ed) >>>>>>> > >>>>>>> > We talked about this on the weekly engineering call today. Ralph has >>>>>>> > an idea what is happening -- I need to do a little investigation >>>>>>> > today and file a bug. I'll make sure you're CC'ed on the bug ticket. >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > On Aug 12, 2014, at 12:27 PM, Timur Ismagilov <tismagi...@mail.ru> >>>>>>> > wrote: >>>>>>> > >>>>>>> >> I don't have this error in OMPI 1.9a1r32252 and OMPI 1.8.1 (with >>>>>>> >> --mca oob_tcp_if_include ib0), but in all latest night snapshots i >>>>>>> >> got this error. >>>>>>> >> >>>>>>> >> >>>>>>> >> Tue, 12 Aug 2014 13:08:12 +0000 от "Jeff Squyres (jsquyres)" >>>>>>> >> <jsquy...@cisco.com>: >>>>>>> >> Are you running any kind of firewall on the node where mpirun is >>>>>>> >> invoked? Open MPI needs to be able to use arbitrary TCP ports >>>>>>> >> between the servers on which it runs. >>>>>>> >> >>>>>>> >> This second mail seems to imply a bug in OMPI's oob_tcp_if_include >>>>>>> >> param handling, however -- it's supposed to be able to handle an >>>>>>> >> interface name (not just a network specification). >>>>>>> >> >>>>>>> >> Ralph -- can you have a look? >>>>>>> >> >>>>>>> >> >>>>>>> >> On Aug 12, 2014, at 8:41 AM, Timur Ismagilov <tismagi...@mail.ru> >>>>>>> >> wrote: >>>>>>> >> >>>>>>> >>> When i add --mca oob_tcp_if_include ib0 (infiniband interface) to >>>>>>> >>> mpirun (as it was here: >>>>>>> >>> http://www.open-mpi.org/community/lists/users/2014/07/24857.php ) i >>>>>>> >>> got this output: >>>>>>> >>> >>>>>>> >>> [compiler-2:08792] mca:base:select:( plm) Querying component >>>>>>> >>> [isolated] >>>>>>> >>> [compiler-2:08792] mca:base:select:( plm) Query of component >>>>>>> >>> [isolated] set priority to 0 >>>>>>> >>> [compiler-2:08792] mca:base:select:( plm) Querying component [rsh] >>>>>>> >>> [compiler-2:08792] mca:base:select:( plm) Query of component [rsh] >>>>>>> >>> set priority to 10 >>>>>>> >>> [compiler-2:08792] mca:base:select:( plm) Querying component [slurm] >>>>>>> >>> [compiler-2:08792] mca:base:select:( plm) Query of component >>>>>>> >>> [slurm] set priority to 75 >>>>>>> >>> [compiler-2:08792] mca:base:select:( plm) Selected component [slurm] >>>>>>> >>> [compiler-2:08792] mca: base: components_register: registering oob >>>>>>> >>> components >>>>>>> >>> [compiler-2:08792] mca: base: components_register: found loaded >>>>>>> >>> component tcp >>>>>>> >>> [compiler-2:08792] mca: base: components_register: component tcp >>>>>>> >>> register function successful >>>>>>> >>> [compiler-2:08792] mca: base: components_open: opening oob >>>>>>> >>> components >>>>>>> >>> [compiler-2:08792] mca: base: components_open: found loaded >>>>>>> >>> component tcp >>>>>>> >>> [compiler-2:08792] mca: base: components_open: component tcp open >>>>>>> >>> function successful >>>>>>> >>> [compiler-2:08792] mca:oob:select: checking available component tcp >>>>>>> >>> [compiler-2:08792] mca:oob:select: Querying component [tcp] >>>>>>> >>> [compiler-2:08792] oob:tcp: component_available called >>>>>>> >>> [compiler-2:08792] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>>>>> >>> [compiler-2:08792] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>>>>> >>> [compiler-2:08792] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>>>>> >>> [compiler-2:08792] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>>>>> >>> [compiler-2:08792] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>>>>> >>> [compiler-2:08792] [[42190,0],0] oob:tcp:init adding 10.128.0.4 to >>>>>>> >>> our list of V4 connections >>>>>>> >>> [compiler-2:08792] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>>>>> >>> [compiler-2:08792] [[42190,0],0] TCP STARTUP >>>>>>> >>> [compiler-2:08792] [[42190,0],0] attempting to bind to IPv4 port 0 >>>>>>> >>> [compiler-2:08792] [[42190,0],0] assigned IPv4 port 53883 >>>>>>> >>> [compiler-2:08792] mca:oob:select: Adding component to end >>>>>>> >>> [compiler-2:08792] mca:oob:select: Found 1 active transports >>>>>>> >>> [compiler-2:08792] mca: base: components_register: registering rml >>>>>>> >>> components >>>>>>> >>> [compiler-2:08792] mca: base: components_register: found loaded >>>>>>> >>> component oob >>>>>>> >>> [compiler-2:08792] mca: base: components_register: component oob >>>>>>> >>> has no register or open function >>>>>>> >>> [compiler-2:08792] mca: base: components_open: opening rml >>>>>>> >>> components >>>>>>> >>> [compiler-2:08792] mca: base: components_open: found loaded >>>>>>> >>> component oob >>>>>>> >>> [compiler-2:08792] mca: base: components_open: component oob open >>>>>>> >>> function successful >>>>>>> >>> [compiler-2:08792] orte_rml_base_select: initializing rml component >>>>>>> >>> oob >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 30 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 15 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 32 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 33 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 5 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 10 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 12 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 9 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 34 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 2 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 21 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 22 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 45 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 46 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 1 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting recv >>>>>>> >>> [compiler-2:08792] [[42190,0],0] posting persistent recv on tag 27 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> Daemon was launched on node1-128-01 - beginning to initialize >>>>>>> >>> Daemon was launched on node1-128-02 - beginning to initialize >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> WARNING: An invalid value was given for oob_tcp_if_include. This >>>>>>> >>> value will be ignored. >>>>>>> >>> >>>>>>> >>> Local host: node1-128-01 >>>>>>> >>> Value: "ib0" >>>>>>> >>> Message: Invalid specification (missing "/") >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> WARNING: An invalid value was given for oob_tcp_if_include. This >>>>>>> >>> value will be ignored. >>>>>>> >>> >>>>>>> >>> Local host: node1-128-02 >>>>>>> >>> Value: "ib0" >>>>>>> >>> Message: Invalid specification (missing "/") >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> None of the TCP networks specified to be included for out-of-band >>>>>>> >>> communications >>>>>>> >>> could be found: >>>>>>> >>> >>>>>>> >>> Value given: >>>>>>> >>> >>>>>>> >>> Please revise the specification and try again. >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> None of the TCP networks specified to be included for out-of-band >>>>>>> >>> communications >>>>>>> >>> could be found: >>>>>>> >>> >>>>>>> >>> Value given: >>>>>>> >>> >>>>>>> >>> Please revise the specification and try again. >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> No network interfaces were found for out-of-band communications. We >>>>>>> >>> require >>>>>>> >>> at least one available network for out-of-band messaging. >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> No network interfaces were found for out-of-band communications. We >>>>>>> >>> require >>>>>>> >>> at least one available network for out-of-band messaging. >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> It looks like orte_init failed for some reason; your parallel >>>>>>> >>> process is >>>>>>> >>> likely to abort. There are many reasons that a parallel process can >>>>>>> >>> fail during orte_init; some of which are due to configuration or >>>>>>> >>> environment problems. This failure appears to be an internal >>>>>>> >>> failure; >>>>>>> >>> here's some additional information (which may only be relevant to an >>>>>>> >>> Open MPI developer): >>>>>>> >>> >>>>>>> >>> orte_oob_base_select failed >>>>>>> >>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> It looks like orte_init failed for some reason; your parallel >>>>>>> >>> process is >>>>>>> >>> likely to abort. There are many reasons that a parallel process can >>>>>>> >>> fail during orte_init; some of which are due to configuration or >>>>>>> >>> environment problems. This failure appears to be an internal >>>>>>> >>> failure; >>>>>>> >>> here's some additional information (which may only be relevant to an >>>>>>> >>> Open MPI developer): >>>>>>> >>> >>>>>>> >>> orte_oob_base_select failed >>>>>>> >>> --> Returned value (null) (-43) instead of ORTE_SUCCESS >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> srun: error: node1-128-02: task 1: Exited with exit code 213 >>>>>>> >>> srun: Terminating job step 657300.0 >>>>>>> >>> srun: error: node1-128-01: task 0: Exited with exit code 213 >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>> >>> communicating back to mpirun. This could be caused by a number >>>>>>> >>> of factors, including an inability to create a connection back >>>>>>> >>> to mpirun due to a lack of common network interfaces and/or no >>>>>>> >>> route found between them. Please check network connectivity >>>>>>> >>> (including firewalls and network routing requirements). >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> [compiler-2:08792] [[42190,0],0] orted_cmd: received halt_vm cmd >>>>>>> >>> [compiler-2:08792] mca: base: close: component oob closed >>>>>>> >>> [compiler-2:08792] mca: base: close: unloading component oob >>>>>>> >>> [compiler-2:08792] [[42190,0],0] TCP SHUTDOWN >>>>>>> >>> [compiler-2:08792] mca: base: close: component tcp closed >>>>>>> >>> [compiler-2:08792] mca: base: close: unloading component tcp >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> Tue, 12 Aug 2014 16:14:58 +0400 от Timur Ismagilov >>>>>>> >>> <tismagi...@mail.ru>: >>>>>>> >>> Hello! >>>>>>> >>> >>>>>>> >>> I have Open MPI v1.8.2rc4r32485 >>>>>>> >>> >>>>>>> >>> When i run hello_c, I got this error message >>>>>>> >>> $mpirun -np 2 hello_c >>>>>>> >>> >>>>>>> >>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>> >>> >>>>>>> >>> communicating back to mpirun. This could be caused by a number >>>>>>> >>> of factors, including an inability to create a connection back >>>>>>> >>> to mpirun due to a lack of common network interfaces and/or no >>>>>>> >>> route found between them. Please check network connectivity >>>>>>> >>> (including firewalls and network routing requirements). >>>>>>> >>> >>>>>>> >>> When i run with --debug-daemons --mca plm_base_verbose 5 -mca >>>>>>> >>> oob_base_verbose 10 -mca rml_base_verbose 10 i got this output: >>>>>>> >>> $mpirun --debug-daemons --mca plm_base_verbose 5 -mca >>>>>>> >>> oob_base_verbose 10 -mca rml_base_verbose 10 -np 2 hello_c >>>>>>> >>> >>>>>>> >>> [compiler-2:08780] mca:base:select:( plm) Querying component >>>>>>> >>> [isolated] >>>>>>> >>> [compiler-2:08780] mca:base:select:( plm) Query of component >>>>>>> >>> [isolated] set priority to 0 >>>>>>> >>> [compiler-2:08780] mca:base:select:( plm) Querying component [rsh] >>>>>>> >>> [compiler-2:08780] mca:base:select:( plm) Query of component [rsh] >>>>>>> >>> set priority to 10 >>>>>>> >>> [compiler-2:08780] mca:base:select:( plm) Querying component [slurm] >>>>>>> >>> [compiler-2:08780] mca:base:select:( plm) Query of component >>>>>>> >>> [slurm] set priority to 75 >>>>>>> >>> [compiler-2:08780] mca:base:select:( plm) Selected component [slurm] >>>>>>> >>> [compiler-2:08780] mca: base: components_register: registering oob >>>>>>> >>> components >>>>>>> >>> [compiler-2:08780] mca: base: components_register: found loaded >>>>>>> >>> component tcp >>>>>>> >>> [compiler-2:08780] mca: base: components_register: component tcp >>>>>>> >>> register function successful >>>>>>> >>> [compiler-2:08780] mca: base: components_open: opening oob >>>>>>> >>> components >>>>>>> >>> [compiler-2:08780] mca: base: components_open: found loaded >>>>>>> >>> component tcp >>>>>>> >>> [compiler-2:08780] mca: base: components_open: component tcp open >>>>>>> >>> function successful >>>>>>> >>> [compiler-2:08780] mca:oob:select: checking available component tcp >>>>>>> >>> [compiler-2:08780] mca:oob:select: Querying component [tcp] >>>>>>> >>> [compiler-2:08780] oob:tcp: component_available called >>>>>>> >>> [compiler-2:08780] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>>>>> >>> [compiler-2:08780] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>>>>> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.251.53 to >>>>>>> >>> our list of V4 connections >>>>>>> >>> [compiler-2:08780] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>>>>> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.0.4 to >>>>>>> >>> our list of V4 connections >>>>>>> >>> [compiler-2:08780] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>>>>> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.2.251.14 to >>>>>>> >>> our list of V4 connections >>>>>>> >>> [compiler-2:08780] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>>>>> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.128.0.4 to >>>>>>> >>> our list of V4 connections >>>>>>> >>> [compiler-2:08780] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>>>>> >>> [compiler-2:08780] [[42202,0],0] oob:tcp:init adding 93.180.7.38 to >>>>>>> >>> our list of V4 connections >>>>>>> >>> [compiler-2:08780] [[42202,0],0] TCP STARTUP >>>>>>> >>> [compiler-2:08780] [[42202,0],0] attempting to bind to IPv4 port 0 >>>>>>> >>> [compiler-2:08780] [[42202,0],0] assigned IPv4 port 38420 >>>>>>> >>> [compiler-2:08780] mca:oob:select: Adding component to end >>>>>>> >>> [compiler-2:08780] mca:oob:select: Found 1 active transports >>>>>>> >>> [compiler-2:08780] mca: base: components_register: registering rml >>>>>>> >>> components >>>>>>> >>> [compiler-2:08780] mca: base: components_register: found loaded >>>>>>> >>> component oob >>>>>>> >>> [compiler-2:08780] mca: base: components_register: component oob >>>>>>> >>> has no register or open function >>>>>>> >>> [compiler-2:08780] mca: base: components_open: opening rml >>>>>>> >>> components >>>>>>> >>> [compiler-2:08780] mca: base: components_open: found loaded >>>>>>> >>> component oob >>>>>>> >>> [compiler-2:08780] mca: base: components_open: component oob open >>>>>>> >>> function successful >>>>>>> >>> [compiler-2:08780] orte_rml_base_select: initializing rml component >>>>>>> >>> oob >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 30 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 15 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 32 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 33 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 5 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 10 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 12 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 9 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 34 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 2 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 21 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 22 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 45 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 46 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 1 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting recv >>>>>>> >>> [compiler-2:08780] [[42202,0],0] posting persistent recv on tag 27 >>>>>>> >>> for peer [[WILDCARD],WILDCARD] >>>>>>> >>> Daemon was launched on node1-130-08 - beginning to initialize >>>>>>> >>> Daemon was launched on node1-130-03 - beginning to initialize >>>>>>> >>> Daemon was launched on node1-130-05 - beginning to initialize >>>>>>> >>> Daemon was launched on node1-130-02 - beginning to initialize >>>>>>> >>> Daemon was launched on node1-130-01 - beginning to initialize >>>>>>> >>> Daemon was launched on node1-130-04 - beginning to initialize >>>>>>> >>> Daemon was launched on node1-130-07 - beginning to initialize >>>>>>> >>> Daemon was launched on node1-130-06 - beginning to initialize >>>>>>> >>> Daemon [[42202,0],3] checking in as pid 7178 on host node1-130-03 >>>>>>> >>> [node1-130-03:07178] [[42202,0],3] orted: up and running - waiting >>>>>>> >>> for commands! >>>>>>> >>> Daemon [[42202,0],2] checking in as pid 13581 on host node1-130-02 >>>>>>> >>> [node1-130-02:13581] [[42202,0],2] orted: up and running - waiting >>>>>>> >>> for commands! >>>>>>> >>> Daemon [[42202,0],1] checking in as pid 17220 on host node1-130-01 >>>>>>> >>> [node1-130-01:17220] [[42202,0],1] orted: up and running - waiting >>>>>>> >>> for commands! >>>>>>> >>> Daemon [[42202,0],5] checking in as pid 6663 on host node1-130-05 >>>>>>> >>> [node1-130-05:06663] [[42202,0],5] orted: up and running - waiting >>>>>>> >>> for commands! >>>>>>> >>> Daemon [[42202,0],8] checking in as pid 6683 on host node1-130-08 >>>>>>> >>> [node1-130-08:06683] [[42202,0],8] orted: up and running - waiting >>>>>>> >>> for commands! >>>>>>> >>> Daemon [[42202,0],7] checking in as pid 7877 on host node1-130-07 >>>>>>> >>> [node1-130-07:07877] [[42202,0],7] orted: up and running - waiting >>>>>>> >>> for commands! >>>>>>> >>> Daemon [[42202,0],4] checking in as pid 7735 on host node1-130-04 >>>>>>> >>> [node1-130-04:07735] [[42202,0],4] orted: up and running - waiting >>>>>>> >>> for commands! >>>>>>> >>> Daemon [[42202,0],6] checking in as pid 8451 on host node1-130-06 >>>>>>> >>> [node1-130-06:08451] [[42202,0],6] orted: up and running - waiting >>>>>>> >>> for commands! >>>>>>> >>> srun: error: node1-130-03: task 2: Exited with exit code 1 >>>>>>> >>> srun: Terminating job step 657040.1 >>>>>>> >>> srun: error: node1-130-02: task 1: Exited with exit code 1 >>>>>>> >>> slurmd[node1-130-04]: *** STEP 657040.1 KILLED AT >>>>>>> >>> 2014-08-12T12:59:07 WITH SIGNAL 9 *** >>>>>>> >>> slurmd[node1-130-07]: *** STEP 657040.1 KILLED AT >>>>>>> >>> 2014-08-12T12:59:07 WITH SIGNAL 9 *** >>>>>>> >>> slurmd[node1-130-06]: *** STEP 657040.1 KILLED AT >>>>>>> >>> 2014-08-12T12:59:07 WITH SIGNAL 9 *** >>>>>>> >>> srun: Job step aborted: Waiting up to 2 seconds for job step to >>>>>>> >>> finish. >>>>>>> >>> srun: error: node1-130-01: task 0: Exited with exit code 1 >>>>>>> >>> srun: error: node1-130-05: task 4: Exited with exit code 1 >>>>>>> >>> srun: error: node1-130-08: task 7: Exited with exit code 1 >>>>>>> >>> srun: error: node1-130-07: task 6: Exited with exit code 1 >>>>>>> >>> srun: error: node1-130-04: task 3: Killed >>>>>>> >>> srun: error: node1-130-06: task 5: Killed >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> An ORTE daemon has unexpectedly failed after launch and before >>>>>>> >>> communicating back to mpirun. This could be caused by a number >>>>>>> >>> of factors, including an inability to create a connection back >>>>>>> >>> to mpirun due to a lack of common network interfaces and/or no >>>>>>> >>> route found between them. Please check network connectivity >>>>>>> >>> (including firewalls and network routing requirements). >>>>>>> >>> -------------------------------------------------------------------------- >>>>>>> >>> [compiler-2:08780] [[42202,0],0] orted_cmd: received halt_vm cmd >>>>>>> >>> [compiler-2:08780] mca: base: close: component oob closed >>>>>>> >>> [compiler-2:08780] mca: base: close: unloading component oob >>>>>>> >>> [compiler-2:08780] [[42202,0],0] TCP SHUTDOWN >>>>>>> >>> [compiler-2:08780] mca: base: close: component tcp closed >>>>>>> >>> [compiler-2:08780] mca: base: close: unloading component tcp >>>>>>> >>> >>>>>>> >>> _______________________________________________ >>>>>>> >>> users mailing list >>>>>>> >>> us...@open-mpi.org >>>>>>> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>> Link to this post: >>>>>>> >>> http://www.open-mpi.org/community/lists/users/2014/08/24987.php >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> _______________________________________________ >>>>>>> >>> users mailing list >>>>>>> >>> us...@open-mpi.org >>>>>>> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>> Link to this post: >>>>>>> >>> http://www.open-mpi.org/community/lists/users/2014/08/24988.php >>>>>>> >> >>>>>>> >> >>>>>>> >> -- >>>>>>> >> Jeff Squyres >>>>>>> >> jsquy...@cisco.com >>>>>>> >> For corporate legal information go to: >>>>>>> >> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> > >>>>>>> > >>>>>>> > -- >>>>>>> > Jeff Squyres >>>>>>> > jsquy...@cisco.com >>>>>>> > For corporate legal information go to: >>>>>>> > http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> > >>>>>>> > _______________________________________________ >>>>>>> > users mailing list >>>>>>> > us...@open-mpi.org >>>>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> > Link to this post: >>>>>>> > http://www.open-mpi.org/community/lists/users/2014/08/25001.php >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jeff Squyres >>>>>>> jsquy...@cisco.com >>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25086.php >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25093.php >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/users/2014/08/25094.php >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/08/25095.php >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/users/2014/08/25105.php >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/25127.php >>>> >>>> >>>> >>>> -- >>>> >>>> Kind Regards, >>>> >>>> M. >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/08/25128.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/08/25129.php >>> >>> >>> >> >> >> >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25154.php > > > >