This seems to fix the problem when using your example on my cluster - please let me know if it solves things for you
oob.diff
Description: Binary data
> On Jul 14, 2015, at 6:08 AM, Audet, Martin <martin.au...@cnrc-nrc.gc.ca> > wrote: > > I will happily test any patch you send me to fix this problem. > > Thanks, > > Martin > > -----Original Message----- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: July 13, 2015 22:55 > To: Open MPI Users > Subject: Re: [OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail between > two different machines > > I see the problem - it's a race condition, actually. I'll try to provide a > patch for you to test, if you don't mind. > > >> On Jul 13, 2015, at 3:03 PM, Audet, Martin <martin.au...@cnrc-nrc.gc.ca> >> wrote: >> >> Thanks Ralph for this quick response. >> >> In the two attachements you will find the output I got when running the >> following commands: >> >> [audet@fn1 mpi]$ mpiexec --mca oob_base_verbose 100 -n 1 >> ./simpleserver 2>&1 | tee server_out.txt >> >> [audet@linux15 mpi]$ mpiexec --mca oob_base_verbose 100 -n 1 >> ./simpleclient >> '2444427264.0;tcp://172.17.15.20:56377+2444427265.0;tcp://172.17.15.20 >> :34776:300' 2>&1 | tee client_out.txt >> >> Martin >> ________________________________________ >> From: users [users-boun...@open-mpi.org] On Behalf Of Ralph Castain >> [r...@open-mpi.org] >> Sent: Monday, July 13, 2015 5:29 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] MPI_Comm_accept() / MPI_Comm_connect() fail >> between two different machines >> >> Try running it with "-mca oob_base_verbose 100" on both client and server - >> it will tell us why the connection was refused. >> >> >>> On Jul 13, 2015, at 2:14 PM, Audet, Martin <martin.au...@cnrc-nrc.gc.ca> >>> wrote: >>> >>> Hi OMPI_Developers, >>> >>> It seems that I am unable to establish an MPI communication between two >>> independently started MPI programs using the simplest client/server call >>> sequence I can imagine (see the two attached files) when the client and >>> server process are started on different machines. Note that I have no >>> problems when the client and server program run on the same machine. >>> >>> For example if I do the following on the server machine (running on fn1): >>> >>> [audet@fn1 mpi]$ mpicc -Wall simpleserver.c -o simpleserver >>> [audet@fn1 mpi]$ mpiexec -n 1 ./simpleserver Server port = >>> '3054370816.0;tcp://172.17.15.20:54458+3054370817.0;tcp://172.17.15.20:58943:300' >>> >>> The server prints its port (created with MPI_Open_port()) and wait for a >>> connection by calling MPI_Comm_accept(). >>> >>> Now on the client machine (running on linux15) if I compile the client and >>> run it with the above port address on the command line, I get: >>> >>> [audet@linux15 mpi]$ mpicc -Wall simpleclient.c -o simpleclient >>> [audet@linux15 mpi]$ mpiexec -n 1 ./simpleclient >>> '3054370816.0;tcp://172.17.15.20:54458+3054370817.0;tcp://172.17.15.20:58943:300' >>> trying to connect... >>> ------------------------------------------------------------ >>> A process or daemon was unable to complete a TCP connection to >>> another process: >>> Local host: linux15 >>> Remote host: linux15 >>> This is usually caused by a firewall on the remote host. Please check >>> that any firewall (e.g., iptables) has been disabled and try again. >>> ------------------------------------------------------------ >>> [linux15:24193] [[13075,0],0]-[[46606,0],0] >>> mca_oob_tcp_peer_send_handler: invalid connection state (6) on socket >>> 16 >>> >>> And then I have to stop the client program by pressing ^C (and also the >>> server which doesn't seems affected). >>> >>> What's wrong ? >>> >>> And I am almost sure there is no firewall running on linux15. >>> >>> It is not the first MPI client/server application I am developing (with >>> both OpenMPI and mpich). >>> These simple MPI client/server programs work well with mpich (version >>> 3.1.3). >>> >>> This problem happens with both OpenMPI 1.8.3 and 1.8.6 >>> >>> linux15 and fn1 run both on Fedora Core 12 Linux (64 bits) and are >>> connected by a Gigabit Ethernet (the normal network). >>> >>> And again if client and server run on the same machine (either fn1 or >>> linux15) no such problems happens. >>> >>> Thanks in advance, >>> >>> Martin >>> Audet<simpleserver.c><simpleclient.c>________________________________ >>> _______________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/07/27271.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/07/27272.php >> <server_out.txt><client_out.txt>______________________________________ >> _________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/07/27273.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27274.php > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/07/27275.php