This seems to fix the problem when using your example on my cluster - please 
let me know if it solves things for you

Attachment: oob.diff
Description: Binary data



> On Jul 14, 2015, at 6:08 AM, Audet, Martin <martin.au...@cnrc-nrc.gc.ca> 
> wrote:
> 
> I will happily test any patch you send me to fix this problem.
> 
> Thanks,
> 
> Martin
> 
> -----Original Message-----
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: July 13, 2015 22:55
> To: Open MPI Users
> Subject: Re: [OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail between 
> two different machines
> 
> I see the problem - it's a race condition, actually. I'll try to provide a 
> patch for you to test, if you don't mind.
> 
> 
>> On Jul 13, 2015, at 3:03 PM, Audet, Martin <martin.au...@cnrc-nrc.gc.ca> 
>> wrote:
>> 
>> Thanks Ralph for this quick response.
>> 
>> In the two attachements you will find the output I got when running the 
>> following commands:
>> 
>> [audet@fn1 mpi]$ mpiexec --mca oob_base_verbose 100 -n 1 
>> ./simpleserver 2>&1 | tee server_out.txt
>> 
>> [audet@linux15 mpi]$ mpiexec --mca oob_base_verbose 100 -n 1 
>> ./simpleclient 
>> '2444427264.0;tcp://172.17.15.20:56377+2444427265.0;tcp://172.17.15.20
>> :34776:300' 2>&1 | tee client_out.txt
>> 
>> Martin
>> ________________________________________
>> From: users [users-boun...@open-mpi.org] On Behalf Of Ralph Castain 
>> [r...@open-mpi.org]
>> Sent: Monday, July 13, 2015 5:29 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail   
>> between two different machines
>> 
>> Try running it with "-mca oob_base_verbose 100" on both client and server - 
>> it will tell us why the connection was refused.
>> 
>> 
>>> On Jul 13, 2015, at 2:14 PM, Audet, Martin <martin.au...@cnrc-nrc.gc.ca> 
>>> wrote:
>>> 
>>> Hi OMPI_Developers,
>>> 
>>> It seems that I am unable to establish an MPI communication between two 
>>> independently started MPI programs using the simplest client/server call 
>>> sequence I can imagine (see the two attached files) when the client and 
>>> server process are started on different machines. Note that I have no 
>>> problems when the client and server program run on the same machine.
>>> 
>>> For example if I do the following on the server machine (running on fn1):
>>> 
>>> [audet@fn1 mpi]$ mpicc -Wall simpleserver.c -o simpleserver
>>> [audet@fn1 mpi]$ mpiexec -n 1 ./simpleserver Server port = 
>>> '3054370816.0;tcp://172.17.15.20:54458+3054370817.0;tcp://172.17.15.20:58943:300'
>>> 
>>> The server prints its port (created with MPI_Open_port()) and wait for a 
>>> connection by calling MPI_Comm_accept().
>>> 
>>> Now on the client machine (running on linux15) if I compile the client and 
>>> run it with the above port address on the command line, I get:
>>> 
>>> [audet@linux15 mpi]$ mpicc -Wall simpleclient.c -o simpleclient
>>> [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient 
>>> '3054370816.0;tcp://172.17.15.20:54458+3054370817.0;tcp://172.17.15.20:58943:300'
>>> trying to connect...
>>> ------------------------------------------------------------
>>> A process or daemon was unable to complete a TCP connection to 
>>> another process:
>>> Local host:    linux15
>>> Remote host:   linux15
>>> This is usually caused by a firewall on the remote host. Please check 
>>> that any firewall (e.g., iptables) has been disabled and try again.
>>> ------------------------------------------------------------
>>> [linux15:24193] [[13075,0],0]-[[46606,0],0] 
>>> mca_oob_tcp_peer_send_handler: invalid connection state (6) on socket 
>>> 16
>>> 
>>> And then I have to stop the client program by pressing ^C (and also the 
>>> server which doesn't seems affected).
>>> 
>>> What's wrong ?
>>> 
>>> And I am almost sure there is no firewall running on linux15.
>>> 
>>> It is not the first MPI client/server application I am developing (with 
>>> both OpenMPI and mpich).
>>> These simple MPI client/server programs work well with mpich (version 
>>> 3.1.3).
>>> 
>>> This problem happens with both OpenMPI 1.8.3 and 1.8.6
>>> 
>>> linux15 and fn1 run both on Fedora Core 12 Linux (64 bits) and are 
>>> connected by a Gigabit Ethernet (the normal network).
>>> 
>>> And again if client and server run on the same machine (either fn1 or 
>>> linux15) no such problems happens.
>>> 
>>> Thanks in advance,
>>> 
>>> Martin 
>>> Audet<simpleserver.c><simpleclient.c>________________________________
>>> _______________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2015/07/27271.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/07/27272.php
>> <server_out.txt><client_out.txt>______________________________________
>> _________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2015/07/27273.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/07/27274.php
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/07/27275.php

Reply via email to