Re: [OMPI users] OpenIB problems

2007-11-29 Thread Neeraj Chourasia
Hi Guys,

   The alternative to THREAD_MULTIPLE problem is to use --mca   
mpi_leave_pinned 1 to mpirun option. This will ensure 1 RDMA operation contrary 
to splitting data in MAX RDMA size (default to 1MB).

If your data size is small say below 1 MB, program will run well with 
THREAD_MULTIPLE. Problem comes when data size increases and OpenMPI starts 
splitting it.

I think even with Bigger sizes, Program works if interconnect is TCP, but fails 
to work on IB. So on IB, you can run your program if you set mca paramter 
mpi_leave_pinned to 1.

Cheers
Neeraj



On Thu, 29 Nov 2007 Brock Palen wrote :
>Jeff thanks for all the reply's,
>
>Hate to admit but at the moment we can't log onto the switch.
>
>But the ibcheckerrors command returns nothing out of bounds, and i
>think that command also checks the switch ports.
>
>Thanks, we will do some tests
>
>Brock Palen
>Center for Advanced Computing
>bro...@umich.edu
>(734)936-1985
>
>
>On Nov 27, 2007, at 4:50 PM, Jeff Squyres wrote:
>
> > Sorry for jumping in late; the holiday and other travel prevented me
> > from getting to all my mail recently...  :-\
> >
> > Have you checked the counters on the subnet manager to see if any
> > other errors are occurring?  It might be good to clear all the
> > counters, run the job, and see if the counters are increasing faster
> > than they should (i.e., any particular counter should advance very
> > very slowly -- perhaps 1 per day or so).
> >
> > I'll ask around the kernel-level guys (i.e., Roland) to see what else
> > could cause this kind of error.
> >
> >
> >
> > On Nov 27, 2007, at 3:35 PM, Brock Palen wrote:
> >
> >> Ok i will open a case with cisco,
> >>
> >>
> >> Brock Palen
> >> Center for Advanced Computing
> >> bro...@umich.edu
> >> (734)936-1985
> >>
> >>
> >> On Nov 27, 2007, at 4:19 PM, Andrew Friedley wrote:
> >>
> >>>
> >>>
> >>> Brock Palen wrote:
> >> What would be a place to look?  Should this just be default then
> >> for
> >> OMPI?  ompi_info shows the default as 10 seconds?  Is that right
> >> 'seconds' ?
> > The other IB guys can probably answer better than I can -- I'm
> > not an
> > expert in this part of IB (or really any part I guess :).  Not
> > sure
> > why
> > a larger value isn't the default.  No, its not seconds -- check
> > the
> > description of the MCA parameter:
> >
> > 4.096 microseconds * (2^btl_openib_ib_timeout)
> 
>  You sure?
>  ompi_info --param btl openib
> 
>  MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
>    InfiniBand transmit timeout, in seconds
>  (must be >= 1)
> >>>
> >>> Yeah:
> >>>
> >>> MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
> >>>  InfiniBand transmit timeout, plugged into formula:
> >>>  4.096 microseconds * (2^btl_openib_ib_timeout)(must be
>  = 0 and <= 31)
> >>>
> >>> Reading earlier in the thread you said OMPI v1.2.0, I got this
> >>> from a
> >>> trunk checkout thats around 3 weeks old.  A quick check shows this
> >>> description was changed between 1.2.0 and 1.2.1.  However the use of
> >>> this parameter hasn't changed -- it's simply passed along to IB
> >>> verbs
> >>> when creating a queue pair (aka a connection).
> >>>
> >>> Andrew
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] [openMPI-infiniband] openMPI in IB network when openSM with LASH is running

2007-11-29 Thread Keshetti Mahesh
> There is work starting literally right about now to allow Open MPI to
> use the RDMA CM and/or the IBCM for creating OpenFabrics connections
> (IB or iWARP).

when this is expected to be completed?

-Mahesh


Re: [OMPI users] ./configure error on windows while installing openmpi-1.2.4(latest)

2007-11-29 Thread geetha r
Hi George,
Thanks for your reply,  i passed --disable-mpi-f77 option to
the configure script, but now the compiler failed with following reason.

configure: error: Cannot support Fortran MPI_ADDRESS_KIND!

  can you pls let me know, how to get rid of this problem.( i.e what option
to pass)

Thanks,
Geetha


On 11/28/07, George Bosilca  wrote:
>
> If your F77 compiler do not support array of LOGICAL variables (which
> seems to be the case if you look in the config.log file), then you're
> left with only one option. Remove the F77 support from the
> compilation. This means adding the --disable-mpi-f77 option to the ./
> configure.
>
>   Thanks,
> george.
>
> On Nov 28, 2007, at 9:24 AM, geetha r wrote:
>
> > Hi,
> >Subject: "Need exact command line for ./configure
> > {optionslist} "  to build OPENMPI-1.2.4 on windows."
> >
> >
> > while configuration script checking the FORTRAN77 compiler , iam
> > getting following error,so openmpi- build is unsuccessful on
> > windows(with configure script)
> >
> >  checking for correct handling of FORTRAN logical arrays... no
> > configure: error: Error determining if arrays of logical values work
> > properly.
> >
> >
> > i want to build, openmpi-1.2.4 (which is downloaded from MINGW), on
> > windows -2000 machine.
> >
> > can somebody give proper build command i can use to "build opennmpi
> > on windows-2000" machine.
> >
> > i.e
> >
> >  ./configure  ...(options list)
> >
> > can some body pls tell "exact options to pass" in the option list.
> >
> > iam using cygwin to build openmpi on windows.
> >
> > PS:
> > I am attaching the output files.
> >
> > config.log -> actual log file.
> > config.out -> output of the ./configure  file
> > make.out -> fail because, configure build unsuccess on windows.
> > make.install-> fail because, configure build unsuccess on windows
> >
> >
> > PS: I am using all g77,g++,gcc from MINGW package.
> >
> > i have downloaded and added g95 also, but which does not solve my
> > problem.
> >
> > Thanks,
> > Geetha
> >
> >
> *
> >
> **
> **
> > ** WARNING:  This email contains an attachment of a very suspicious
> > type.  **
> > ** You are urged NOT to open this attachment unless you are
> > absolutely **
> > ** sure it is legitimate.  Opening this attachment may cause
> > irreparable   **
> > ** damage to your computer and your files.  If you have any
> > questions  **
> > ** about the validity of this message, PLEASE SEEK HELP BEFORE
> > OPENING IT. **
> >
> **
> **
> > ** This warning was added by the IU Computer Science Dept. mail
> > scanner.   **
> >
> *
> >
> > <
> > make
> > .install
> > .zip
> > >
> > <
> > make
> > .out
> > .zip
> > >
> > <
> > config
> > .out.zip>___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>


[OMPI users] configure: error: Cannot support Fortran MPI_ADDRESS_KIND!

2007-11-29 Thread geetha r
Hi Terry,
   Thanks for your reply, ARRAY of logical problem gone,  when i
used -disable-mpi-f77 option,but now  iam getting following error

configure: error: Cannot support Fortran MPI_ADDRESS_KIND!


option string iam using as follows

./configure --disable-mpi-f77  --with-devel-headers.

Thanks,
geetha.

On 11/29/07, Terry Frankcombe  wrote:
>
> On Wed, 2007-11-28 at 13:20 -0500, George Bosilca wrote:
> > If your F77 compiler do not support array of LOGICAL variables (which
> > seems to be the case if you look in the config.log file), then you're
> > left with only one option. Remove the F77 support from the
> > compilation. This means adding the --disable-mpi-f77 option to the ./
> > configure.
>
> It's a lot weirder than that.
>
> configure: WARNING: *** Fortran 77 REAL*8 does not have expected size!
> configure: WARNING: *** Expected 8, got 8
> configure: WARNING: *** Disabling MPI support for Fortran 77 REAL*8
>
> Somehow, 8/=8
>
> :-\
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Newbie: Using hostfile

2007-11-29 Thread Madireddy Samuel Vijaykumar
A non MPI application does run without any issues. Could eloberate on
what you mean by doing mpirun "hostname". You mean i just do an
'mpirun lynx' in my case???

On Nov 28, 2007 9:57 PM, Jeff Squyres  wrote:
> Well, that's odd.
>
> What happens if you try to mpirun "hostname" (i.e., a non-MPI
> application)?  Does it run, or does it hang?
>
>
>
> On Nov 23, 2007, at 6:00 AM, Madireddy Samuel Vijaykumar wrote:
>
> > I have been using using clusters for some tests. My localhost "lynx"
> > and i have "puma" and "tiger" which make up the cluster. All have
> > passwordless ssh enabled. Now if i have the following in my
> > hostfile(perline in the same order)
> >
> > lynx
> > puma
> > tiger
> >
> > My tests(from lynx) run over the cluster without any issues.
> >
> > But if move/remove the lynx from there either (perline in the same
> > order)
> >
> > puma
> > lynx
> > tiger
> >
> > or
> >
> > puma
> > tiger
> >
> > My test(from lynx) just does not get any where. It just hangs. And
> > does not proceed at all. Is this an issue with way my script handles
> > the cluster node. Or is there an method for the hostfile. Thanks.
> >
> > --
> > Sam aka Vijju
> > :)~
> > Linux: Open, True and Cool
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Sam aka Vijju
:)~
Linux: Open, True and Cool


Re: [OMPI users] Run a process double

2007-11-29 Thread Reuti

Hi,

Am 29.11.2007 um 00:02 schrieb Henry Adolfo Lambis Miranda:


This is my first post to the mail list.
I have installed openmp 1.2.4 over a x_64 AMD double processor with  
SuSE

linux.
In principal, the instalation was succefull, with ifort 10.X.
But when i run any code ( mpirun -np 2 a.out), instead of share the
calcules between the two
processor, the system duplicate the executable and send one to each
processor.


this seems to be fine. What were you expecting? With OpenMP you will  
see threads, and with OpenMPI processes.


-- Reuti




i don´t know what the h$%& is going on..



regards..

Henry

--
Henry Adolfo Lambis Miranda,Chem.Eng.
Molecular Simulation Group  I & II
Rovira i Virgili University.
http://www.etseq.urv.es/ms
Av. Pa?sos Catalans, 26
C.P. 43007. Tarragona, Catalunya
Espanya.


"No podr?s quedarte en casa, hermano.
No podr?s encender, apagar y olvidarte
() Porque la revoluci?n no ser? televisada".
Gil Scott-Heron (The Revolution Will Not Be Televised, 1974)

Es una cosa bastante repugnante el exito. Su falsa semejanza con el  
merito enga?a a los hombres. -- Victor Hugo. (1802-1885) Novelista  
franc?s.


El militar es una planta que hay que cuidar con esmero para que no  
de sus frutos. -- Jacques Tati.


"La libertad viene en paquetes peque?os, usualmente TCP/IP"

Colombian Reality bite:
http://www.youtube.com/watch?v=jn3vM_5kIgM

http://en.wikipedia.org/wiki/Cartagena,_Colombia

http://www.youtube.com/watch?v=cvxMWSsrwg0

http://www.youtube.com/watch?v=eVmYf5U6x3k








__
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Re: [OMPI users] Newbie: Using hostfile

2007-11-29 Thread Jeff Squyres

On Nov 29, 2007, at 2:09 AM, Madireddy Samuel Vijaykumar wrote:


A non MPI application does run without any issues. Could eloberate on
what you mean by doing mpirun "hostname". You mean i just do an
'mpirun lynx' in my case???


No, I mean

   mpirun --hostfile  hostname

This should run the "hostname" command on each of your nodes.  If  
running "hostname" doesn't work after changing the order, then  
something is very wrong.  If it *does* work, it implies something that  
there is faulty in the MPI startup (which is more complicated than  
starting up non-MPI applications).




On Nov 28, 2007 9:57 PM, Jeff Squyres  wrote:

Well, that's odd.

What happens if you try to mpirun "hostname" (i.e., a non-MPI
application)?  Does it run, or does it hang?



On Nov 23, 2007, at 6:00 AM, Madireddy Samuel Vijaykumar wrote:


I have been using using clusters for some tests. My localhost "lynx"
and i have "puma" and "tiger" which make up the cluster. All have
passwordless ssh enabled. Now if i have the following in my
hostfile(perline in the same order)

lynx
puma
tiger

My tests(from lynx) run over the cluster without any issues.

But if move/remove the lynx from there either (perline in the same
order)

puma
lynx
tiger

or

puma
tiger

My test(from lynx) just does not get any where. It just hangs. And
does not proceed at all. Is this an issue with way my script handles
the cluster node. Or is there an method for the hostfile. Thanks.

--
Sam aka Vijju
:)~
Linux: Open, True and Cool
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





--
Sam aka Vijju
:)~
Linux: Open, True and Cool
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] [openMPI-infiniband] openMPI in IB network when openSM with LASH is running

2007-11-29 Thread Jeff Squyres

On Nov 29, 2007, at 12:08 AM, Keshetti Mahesh wrote:


There is work starting literally right about now to allow Open MPI to
use the RDMA CM and/or the IBCM for creating OpenFabrics connections
(IB or iWARP).


when this is expected to be completed?



It will not planned to be released until the v1.3 series is released.

See

http://www.open-mpi.org/community/lists/users/2007/11/4535.php
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3

--
Jeff Squyres
Cisco Systems



Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-29 Thread Ralph H Castain
Hi Bob

I'm afraid the person most familiar with the oob subsystem recently left the
project, so we are somewhat hampered at the moment. I don't recognize the
"Software caused connection abort" error message - it doesn't appear to be
one of ours (at least, I couldn't find it anywhere in our code base, though
I can't swear it isn't there in some dark corner), and I don't find it in my
own sys/errno.h file.

With those caveats, all I can say is that something appears to be blocking
the connection from your remote node back to the head node. Are you sure
both nodes are available on IPv4 (since you disabled IPv6)? Can you try
ssh'ing to the remote node and doing a ping to the head node using the IPv4
interface?

Do you have another method you could use to check and see if max14 will
accept connections from max15? If I interpret the error message correctly,
it looks like something in the connect handshake is being aborted. We try a
couple of times, but then give up and try other interfaces - since no other
interface is available, you get that other error message and we abort.

Sorry I can't be more help - like I said, this is now a weak spot in our
coverage that needs to be rebuilt.

Ralph



On 11/28/07 2:41 PM, "Bob Soliday"  wrote:

> I am new to openmpi and have a problem that I cannot seem to solve.
> I am trying to run the hello_c example and I can't get it to work.
> I compiled openmpi with:
> 
> ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6
> --with-openib
> 
> The hostname file contains the local host and one other node. When I
> run it I get:
> 
> 
> [soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun --
> debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2
> hello_c
> [max14:31465] [0,0,0] accepting connections via event library
> [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe
> [max14:31466] [0,0,1] accepting connections via event library
> [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting
> port 55152 to: 192.168.2.14:38852
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
> sending ack, 0
> [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255
> [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14
> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802
> [max14:31466] [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14
> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
> Daemon [0,0,1] checking in as pid 31466 on host max14
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
> 192.168.1.14:38852 failed: Software caused connection abort (103)
> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
> 192.168.1.14:38852 failed: Software caused connection abort (103)
> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
> 192.168.1.14:38852 failed, connecting over all interfaces failed!
> [max15:28222] OOB: Connection to HNP lost
> [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [max14:31466] [0,0,1] orted_recv_pls: received kill_local_procs
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15
> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/
> pls_base_orted_cmds.c at line 275
> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
> at line 1166
> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
> line 90
> [max14:31465] ERROR: A daemon on node max15 failed to start as expected.
> [max14:31465] ERROR: There may be more information available from
> [max14:31465] ERROR: the remote shell (see above).
> [max14:31465] ERROR: The daemon exited unexpectedly with status 1.
> [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [max14:31466] [0,0,1] orted_recv_pls: received exit
> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15
> [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: peer closed
> connection
> [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_peer_close(0x523100) sd 6
> state 4
> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/
> pls_base_orted_cmds.c at line 188
> [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
> at line 1198
> --
> mpirun was unable to cleanly terminate the daemons for this job.
> Returned value Timeout instead of ORTE_SUCCESS.
> --
> 
> 
> 
> I can see that the orted deamon program is starting on both computers
> but i

Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-29 Thread Bob Soliday

I solved the problem by making a change to orte/mca/oob/tcp/oob_tcp_peer.c

On Linux 2.6 I have read that after a failed connect system call the next
call to connect can immediately return ECONNABORTED and not try to actually
connect, the next call to connect will then work. So I changed
mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call connect
again. The hello_c example script is now working.

I don't think this has solved the underlying cause as to way connect is
failing in the first place but at least now I move on to the next step. My
best guess at the moment is that it is using eth0 initially when I want it
to use eth1. This fails and then when it moves on to eth1 I run into the
"can't call connect after it just failed bug".

--Bob


Ralph H Castain wrote:

Hi Bob

I'm afraid the person most familiar with the oob subsystem recently left the
project, so we are somewhat hampered at the moment. I don't recognize the
"Software caused connection abort" error message - it doesn't appear to be
one of ours (at least, I couldn't find it anywhere in our code base, though
I can't swear it isn't there in some dark corner), and I don't find it in my
own sys/errno.h file.

With those caveats, all I can say is that something appears to be blocking
the connection from your remote node back to the head node. Are you sure
both nodes are available on IPv4 (since you disabled IPv6)? Can you try
ssh'ing to the remote node and doing a ping to the head node using the IPv4
interface?

Do you have another method you could use to check and see if max14 will
accept connections from max15? If I interpret the error message correctly,
it looks like something in the connect handshake is being aborted. We try a
couple of times, but then give up and try other interfaces - since no other
interface is available, you get that other error message and we abort.

Sorry I can't be more help - like I said, this is now a weak spot in our
coverage that needs to be rebuilt.

Ralph
 



On 11/28/07 2:41 PM, "Bob Soliday"  wrote:


I am new to openmpi and have a problem that I cannot seem to solve.
I am trying to run the hello_c example and I can't get it to work.
I compiled openmpi with:

./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6
--with-openib

The hostname file contains the local host and one other node. When I
run it I get:


[soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun --
debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2
hello_c
[max14:31465] [0,0,0] accepting connections via event library
[max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1] accepting connections via event library
[max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting
port 55152 to: 192.168.2.14:38852
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
sending ack, 0
[max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255
[max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14
nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802
[max14:31466] [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14
nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
Daemon [0,0,1] checking in as pid 31466 on host max14
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
192.168.1.14:38852 failed, connecting over all interfaces failed!
[max15:28222] OOB: Connection to HNP lost
[max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
[max14:31466] [0,0,1] orted_recv_pls: received kill_local_procs
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15
[max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/
pls_base_orted_cmds.c at line 275
[max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1166
[max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
line 90
[max14:31465] ERROR: A daemon on node max15 failed to start as expected.
[max14:31465] ERROR: There may be more information available from
[max14:31465] ERROR: the remote shell (see above).
[max14:31465] ERROR: The daemon exited unexpectedly with status 1.
[max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
[max14:31466] [0,0,1] orted_recv_pls: received exit
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15
[max14:31465] [0,0,0]-[

Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-29 Thread Jeff Squyres (jsquyres)
Interesting.  Would you mind sharing your patch? 

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Bob Soliday
Sent: Thursday, November 29, 2007 11:35 AM
To: Ralph H Castain
Cc: Open MPI Users 
Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

I solved the problem by making a change to
orte/mca/oob/tcp/oob_tcp_peer.c

On Linux 2.6 I have read that after a failed connect system call the
next call to connect can immediately return ECONNABORTED and not try to
actually connect, the next call to connect will then work. So I changed
mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
connect again. The hello_c example script is now working.

I don't think this has solved the underlying cause as to way connect is
failing in the first place but at least now I move on to the next step.
My best guess at the moment is that it is using eth0 initially when I
want it to use eth1. This fails and then when it moves on to eth1 I run
into the "can't call connect after it just failed bug".

--Bob


Ralph H Castain wrote:
> Hi Bob
> 
> I'm afraid the person most familiar with the oob subsystem recently 
> left the project, so we are somewhat hampered at the moment. I don't 
> recognize the "Software caused connection abort" error message - it 
> doesn't appear to be one of ours (at least, I couldn't find it 
> anywhere in our code base, though I can't swear it isn't there in some

> dark corner), and I don't find it in my own sys/errno.h file.
> 
> With those caveats, all I can say is that something appears to be 
> blocking the connection from your remote node back to the head node. 
> Are you sure both nodes are available on IPv4 (since you disabled 
> IPv6)? Can you try ssh'ing to the remote node and doing a ping to the 
> head node using the IPv4 interface?
> 
> Do you have another method you could use to check and see if max14 
> will accept connections from max15? If I interpret the error message 
> correctly, it looks like something in the connect handshake is being 
> aborted. We try a couple of times, but then give up and try other 
> interfaces - since no other interface is available, you get that other
error message and we abort.
> 
> Sorry I can't be more help - like I said, this is now a weak spot in 
> our coverage that needs to be rebuilt.
> 
> Ralph
>  
> 
> 
> On 11/28/07 2:41 PM, "Bob Soliday"  wrote:
> 
>> I am new to openmpi and have a problem that I cannot seem to solve.
>> I am trying to run the hello_c example and I can't get it to work.
>> I compiled openmpi with:
>>
>> ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6

>> --with-openib
>>
>> The hostname file contains the local host and one other node. When I 
>> run it I get:
>>
>>
>> [soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun 
>> -- debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2 
>> hello_c [max14:31465] [0,0,0] accepting connections via event library

>> [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe 
>> [max14:31466] [0,0,1] accepting connections via event library 
>> [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe 
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] 
>> [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152 
>> to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0] 
>> mca_oob_tcp_peer_complete_connect:
>> sending ack, 0
>> [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255 
>> [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14 
>> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] 
>> [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1 
>> sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466] 
>> [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0]

>> mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] 
>> mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on 
>> host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 
>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222] 
>> [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
>> 192.168.1.14:38852 failed: Software caused connection abort (103) 
>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect 
>> to
>> 192.168.1.14:38852 failed: Software caused connection abort (103) 
>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect 
>> to
>> 192.168.1.14:38852 failed, connecting over all interfaces failed!
>> [max15:28222] OOB: Connection to HNP lost [max14:31466] [0,0,1] 
>> orted_recv_pls: received message from [0,0,0] [max14:31466] [0,0,1] 
>> orted_recv_pls: received kill_local_procs [max14:31466] 
>> [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465] [0,0,0] 
>> ORTE_ERROR_LOG: Timeout in file base/ pls_base_orted_cmds.c at line 
>> 275 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file 
>> pls_rs

Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-29 Thread Bob Soliday

Jeff Squyres (jsquyres) wrote:
Interesting.  Would you mind sharing your patch? 


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Bob Soliday
Sent: Thursday, November 29, 2007 11:35 AM
To: Ralph H Castain
Cc: Open MPI Users 
Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

I solved the problem by making a change to
orte/mca/oob/tcp/oob_tcp_peer.c

On Linux 2.6 I have read that after a failed connect system call the
next call to connect can immediately return ECONNABORTED and not try to
actually connect, the next call to connect will then work. So I changed
mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
connect again. The hello_c example script is now working.

I don't think this has solved the underlying cause as to way connect is
failing in the first place but at least now I move on to the next step.
My best guess at the moment is that it is using eth0 initially when I
want it to use eth1. This fails and then when it moves on to eth1 I run
into the "can't call connect after it just failed bug".

--Bob




I changed oob_tcp_peer.c at line 289 from:


/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
  /* non-blocking so wait for completion */
  if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
opal_event_add(&peer->peer_send_event, 0);
return ORTE_SUCCESS;
  }
  opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
  "connect to %s:%d failed: %s (%d)",
  ORTE_NAME_ARGS(orte_process_info.my_name),
  ORTE_NAME_ARGS(&(peer->peer_name)),
  inet_ntoa(inaddr.sin_addr),
  ntohs(inaddr.sin_port),
  strerror(opal_socket_errno),
  opal_socket_errno);
  continue;
}


to:


/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
  /* non-blocking so wait for completion */
  if (opal_socket_errno == ECONNABORTED) {
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
  if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
opal_event_add(&peer->peer_send_event, 0);
return ORTE_SUCCESS;
  }
  opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: 
"
  "connect to %s:%d failed: %s (%d)",
  ORTE_NAME_ARGS(orte_process_info.my_name),
  ORTE_NAME_ARGS(&(peer->peer_name)),
  inet_ntoa(inaddr.sin_addr),
  ntohs(inaddr.sin_port),
  strerror(opal_socket_errno),
  opal_socket_errno);
  continue;
}
  } else {
if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
  opal_event_add(&peer->peer_send_event, 0);
  return ORTE_SUCCESS;
}
opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
"connect to %s:%d failed: %s (%d)",
ORTE_NAME_ARGS(orte_process_info.my_name),
ORTE_NAME_ARGS(&(peer->peer_name)),
inet_ntoa(inaddr.sin_addr),
ntohs(inaddr.sin_port),
strerror(opal_socket_errno),
opal_socket_errno);
continue;
  }
}



[OMPI users] Reinitialize MPI_COMM_WORLD

2007-11-29 Thread Rajesh Sudarsan
Hi,

I have simple MPI program that uses MPI_comm_spawn to create additional
child processes.
Using  MPI_Intercomm_merge, I merge the child and the parent
communicator resulting in a single expanded user
defined intracommunicator. I know MPI_COMM_WORLD is a constant which is
statically initialized during MPI_Init call. But
is there a way to update the value of MPI_COMM_WORLD at runtime
to reflect this expanded set of processes? Is it possible to some how
reinitialize MPI_COMM_WORLD using ompi_comm_init() function?

Regards,
Rajesh


Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-29 Thread Ralph Castain
If you wanted it to use eth1, your other option would be to simply tell it
to do so using the mca param. I believe it is something like -mca
oob_tcp_if_include eth1 -mca oob_tcp_if_exclude eth0

You may only need the latter since you only have the two interfaces.
Ralph



On 11/29/07 9:47 AM, "Jeff Squyres (jsquyres)"  wrote:

> Interesting.  Would you mind sharing your patch?
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Bob Soliday
> Sent: Thursday, November 29, 2007 11:35 AM
> To: Ralph H Castain
> Cc: Open MPI Users 
> Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
> 
> I solved the problem by making a change to
> orte/mca/oob/tcp/oob_tcp_peer.c
> 
> On Linux 2.6 I have read that after a failed connect system call the
> next call to connect can immediately return ECONNABORTED and not try to
> actually connect, the next call to connect will then work. So I changed
> mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
> connect again. The hello_c example script is now working.
> 
> I don't think this has solved the underlying cause as to way connect is
> failing in the first place but at least now I move on to the next step.
> My best guess at the moment is that it is using eth0 initially when I
> want it to use eth1. This fails and then when it moves on to eth1 I run
> into the "can't call connect after it just failed bug".
> 
> --Bob
> 
> 
> Ralph H Castain wrote:
>> Hi Bob
>> 
>> I'm afraid the person most familiar with the oob subsystem recently
>> left the project, so we are somewhat hampered at the moment. I don't
>> recognize the "Software caused connection abort" error message - it
>> doesn't appear to be one of ours (at least, I couldn't find it
>> anywhere in our code base, though I can't swear it isn't there in some
> 
>> dark corner), and I don't find it in my own sys/errno.h file.
>> 
>> With those caveats, all I can say is that something appears to be
>> blocking the connection from your remote node back to the head node.
>> Are you sure both nodes are available on IPv4 (since you disabled
>> IPv6)? Can you try ssh'ing to the remote node and doing a ping to the
>> head node using the IPv4 interface?
>> 
>> Do you have another method you could use to check and see if max14
>> will accept connections from max15? If I interpret the error message
>> correctly, it looks like something in the connect handshake is being
>> aborted. We try a couple of times, but then give up and try other
>> interfaces - since no other interface is available, you get that other
> error message and we abort.
>> 
>> Sorry I can't be more help - like I said, this is now a weak spot in
>> our coverage that needs to be rebuilt.
>> 
>> Ralph
>>  
>> 
>> 
>> On 11/28/07 2:41 PM, "Bob Soliday"  wrote:
>> 
>>> I am new to openmpi and have a problem that I cannot seem to solve.
>>> I am trying to run the hello_c example and I can't get it to work.
>>> I compiled openmpi with:
>>> 
>>> ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6
> 
>>> --with-openib
>>> 
>>> The hostname file contains the local host and one other node. When I
>>> run it I get:
>>> 
>>> 
>>> [soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun
>>> -- debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2
>>> hello_c [max14:31465] [0,0,0] accepting connections via event library
> 
>>> [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe
>>> [max14:31466] [0,0,1] accepting connections via event library
>>> [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe
>>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466]
>>> [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152
>>> to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0]
>>> mca_oob_tcp_peer_complete_connect:
>>> sending ack, 0
>>> [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255
>>> [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14
>>> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466]
>>> [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1
>>> sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466]
>>> [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0]
> 
>>> mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0]
>>> mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on
>>> host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
>>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222]
>>> [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
>>> 192.168.1.14:38852 failed: Software caused connection abort (103)
>>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect
>>> to
>>> 192.168.1.14:38852 failed: Software caused connection abort (103)
>>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect
>>> to
>>> 192.168.1.14:38852 failed, connecting ove

Re: [OMPI users] Reinitialize MPI_COMM_WORLD

2007-11-29 Thread Edgar Gabriel
no, unfortunately there is no way to do that. In fact, each set of child 
processes which you spawn has its own MPI_COMM_WORLD. MPI_COMM_WORLD is 
static and there is no way to change it at runtime...


Edgar

Rajesh Sudarsan wrote:

Hi,

I have simple MPI program that uses MPI_comm_spawn to create additional 
child processes. 
Using  MPI_Intercomm_merge, I merge the child and the parent communicator resulting in a single expanded user 
defined intracommunicator. I know MPI_COMM_WORLD is a constant which is 
statically initialized during MPI_Init call. But 
is there a way to update the value of MPI_COMM_WORLD at runtime 
to reflect this expanded set of processes? Is it possible to some how 
reinitialize MPI_COMM_WORLD using ompi_comm_init() function?


Regards,
Rajesh




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-29 Thread Bob Soliday

Thanks, this works. I have now removed my change to oob_tcp_peer.c.

--Bob Soliday

Ralph Castain wrote:

If you wanted it to use eth1, your other option would be to simply tell it
to do so using the mca param. I believe it is something like -mca
oob_tcp_if_include eth1 -mca oob_tcp_if_exclude eth0

You may only need the latter since you only have the two interfaces.
Ralph



On 11/29/07 9:47 AM, "Jeff Squyres (jsquyres)"  wrote:


Interesting.  Would you mind sharing your patch?

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Bob Soliday
Sent: Thursday, November 29, 2007 11:35 AM
To: Ralph H Castain
Cc: Open MPI Users 
Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

I solved the problem by making a change to
orte/mca/oob/tcp/oob_tcp_peer.c

On Linux 2.6 I have read that after a failed connect system call the
next call to connect can immediately return ECONNABORTED and not try to
actually connect, the next call to connect will then work. So I changed
mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
connect again. The hello_c example script is now working.

I don't think this has solved the underlying cause as to way connect is
failing in the first place but at least now I move on to the next step.
My best guess at the moment is that it is using eth0 initially when I
want it to use eth1. This fails and then when it moves on to eth1 I run
into the "can't call connect after it just failed bug".

--Bob


Ralph H Castain wrote:

Hi Bob

I'm afraid the person most familiar with the oob subsystem recently
left the project, so we are somewhat hampered at the moment. I don't
recognize the "Software caused connection abort" error message - it
doesn't appear to be one of ours (at least, I couldn't find it
anywhere in our code base, though I can't swear it isn't there in some
dark corner), and I don't find it in my own sys/errno.h file.

With those caveats, all I can say is that something appears to be
blocking the connection from your remote node back to the head node.
Are you sure both nodes are available on IPv4 (since you disabled
IPv6)? Can you try ssh'ing to the remote node and doing a ping to the
head node using the IPv4 interface?

Do you have another method you could use to check and see if max14
will accept connections from max15? If I interpret the error message
correctly, it looks like something in the connect handshake is being
aborted. We try a couple of times, but then give up and try other
interfaces - since no other interface is available, you get that other

error message and we abort.

Sorry I can't be more help - like I said, this is now a weak spot in
our coverage that needs to be rebuilt.

Ralph
 



On 11/28/07 2:41 PM, "Bob Soliday"  wrote:


I am new to openmpi and have a problem that I cannot seem to solve.
I am trying to run the hello_c example and I can't get it to work.
I compiled openmpi with:

./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6
--with-openib

The hostname file contains the local host and one other node. When I
run it I get:


[soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun
-- debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2
hello_c [max14:31465] [0,0,0] accepting connections via event library
[max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1] accepting connections via event library
[max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466]
[0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152
to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0]
mca_oob_tcp_peer_complete_connect:
sending ack, 0
[max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255
[max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14
nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466]
[0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1
sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466]
[0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0]
mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0]
mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on
host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222]
[0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect
to
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect
to
192.168.1.14:38852 failed, connecting over all interfaces failed!
[max15:28222] OOB: Connection to HNP lost [max14:31466] [0,0,1]
orted_recv_pls: received message from [0,0,0] [max14:31466] [0,0,1]
orted_recv_pls: received kill_local_procs [max14:3146