Thank you all for your replies.
I have now tested the code with various setups and versions. First of
all, the tcp btl seems to work fine (I had patience to check ~10 runs),
the openib is the problem. I have also compiled using the Intel compiler
and the story is the same as when using gcc.
I have then tested many openmpi versions from 1.7.5 to 1.10.0 using
bisection ;) Versions up to and including 1.8.3 worked fine (at least
above 5 times and around 10), the problem was likely introduced in
version 1.8.4. Actually, version 1.8.4 was the only one to spit out some
interesting warning on the receiver side at the moment it hang:
[warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one
event_base_loop can run on each event_base at once.
which may or may not be of importance in this particular case ;)
So to summarize, problem appeared in openib btl in version 1.8.4.
Does anybody have any more ideas?
Thanks!
Marcin
On 09/16/2015 05:59 PM, Burns, Andrew J CTR USARMY RDECOM ARL (US) wrote:
CLASSIFICATION: UNCLASSIFIED
Have you attempted using 2 cores per process? I have noticed that
MPI_Comm_accept sometimes behaves strangely on single core variations.
I have a program that makes use of Comm_accept/connect and I also call
MPI_Comm_merge. So, you may want to look into that call as well.
-Andrew Burns
-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jalel Chergui
Sent: Wednesday, September 16, 2015 11:49 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] bug in MPI_Comm_accept?
This email was sent from a non-Department of Defense email account, and
contained active links. All links are disabled, and require you to copy and
paste the address to a Web browser. Please verify the identity of the sender,
and confirm authenticity of all links contained within the message.
With openmpi-1.7.5, the sender segfaults.
Sorry, I cannot see the problem in the codes. Perhaps people out there may help.
Jalel
Le 16/09/2015 16:40, marcin.krotkiewski a ?crit :
I have removed the MPI_Barrier, to no avail. Same thing happens. Adding
verbosity, before the receiver hangs I get the following message
[node2:03928] mca: bml: Using openib btl to [[12620,1],0] on node node3
So It is somewhere in the openib btl module
Marcin
On 09/16/2015 04:34 PM, Jalel Chergui wrote:
Right, anyway Finalize is necessary at the end of the receiver. The other issue
is Barrier which is invoked probably when the sender has exited hence changing
the size of intercom. Can you comment that line in both files ?
Jalel
Le 16/09/2015 16:22, Marcin Krotkiewski a ?crit :
But where would I put it? If I put it in the while(1), then MPI_Comm_Accept
cannot be called for the second time. If I put it outside of the loop it will
never be called.
On 09/16/2015 04:18 PM, Jalel Chergui wrote:
Can you check with an MPI_Finalize in the receiver ?
Jalel
Le 16/09/2015 16:06, marcin.krotkiewski a ?crit :
I have run into a freeze / potential bug when using MPI_Comm_accept in a simple
client / server implementation. I have attached two simplest programs I could
produce:
1. mpi-receiver.c opens a port using MPI_Open_port, saves the port name to a
file
2. mpi-receiver enters infinite loop and waits for connections using
MPI_Comm_accept
3. mpi-sender.c connects to that port using MPI_Comm_connect, sends one
MPI_UNSIGNED_LONG, calls barrier and disconnects using MPI_Comm_disconnect
4. mpi-receiver reads the MPI_UNSIGNED_LONG, prints it, calls barrier and
disconnects using MPI_Comm_disconnect and goes to point 2 - infinite loop
All works fine, but only exactly 5 times. After that the receiver hangs in
MPI_Recv, after exit from MPI_Comm_accept. That is 100% repeatable. I have
tried with Intel MPI - no such problem.
I execute the programs using OpenMPI 1.10 as follows
mpirun -np 1 --mca mpi_leave_pinned 0 ./mpi-receiver
Do you have any clues what could be the reason? Am I doing sth wrong, or is it
some problem with internal state of OpenMPI?
Thanks a lot!
Marcin
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: Caution-www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
Caution-www.open-mpi.org/community/lists/users/2015/09/27585.php
--
*------------------------------------------------------------------------*
Jalel CHERGUI, LIMSI-CNRS, B?t. 508 - BP 133, 91403 Orsay cedex, FRANCE
T?l: (33 1) 69 85 81 27 ; T?l?copie: (33 1) 69 85 80 88
M?l: jalel.cher...@limsi.fr<mailto:jalel.cher...@limsi.fr> ; R?f?rence:
Caution-perso.limsi.fr/chergui
*------------------------------------------------------------------------*
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: Caution-www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
Caution-www.open-mpi.org/community/lists/users/2015/09/27586.php
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: Caution-www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
Caution-www.open-mpi.org/community/lists/users/2015/09/27587.php
--
*------------------------------------------------------------------------*
Jalel CHERGUI, LIMSI-CNRS, B?t. 508 - BP 133, 91403 Orsay cedex, FRANCE
T?l: (33 1) 69 85 81 27 ; T?l?copie: (33 1) 69 85 80 88
M?l: jalel.cher...@limsi.fr<mailto:jalel.cher...@limsi.fr> ; R?f?rence:
Caution-perso.limsi.fr/chergui
*------------------------------------------------------------------------*
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: Caution-www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
Caution-www.open-mpi.org/community/lists/users/2015/09/27588.php
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: Caution-www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
Caution-www.open-mpi.org/community/lists/users/2015/09/27589.php
--
*------------------------------------------------------------------------*
Jalel CHERGUI, LIMSI-CNRS, B?t. 508 - BP 133, 91403 Orsay cedex, FRANCE
T?l: (33 1) 69 85 81 27 ; T?l?copie: (33 1) 69 85 80 88
M?l: jalel.cher...@limsi.fr<mailto:jalel.cher...@limsi.fr> ; R?f?rence:
Caution-perso.limsi.fr/chergui
*------------------------------------------------------------------------*
CLASSIFICATION: UNCLASSIFIED
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/09/27594.php