my bad for the confusion, I misread you and miswrote my reply.
I will investigate this again. strictly speaking, the clients can only start after the server first write the port info to a file. if you start the client right after the server start, they might use incorrect/outdated info and cause all the test hang. I will start reproducing the hang Cheers, Gilles On Tuesday, July 19, 2016, M. D. <matus.dobro...@gmail.com> wrote: > Yes I understand it, but I think, this is exactly that situation you are > talking about. In my opinion, the test is doing exactly what you said - > when a new player is willing to join, other players must invoke > MPI_Comm_accept(). > All *other* players must invoke MPI_Comm_accept(). Only the last client > (in this case last player which wants to join) does not > invoke MPI_Comm_accept(), because this client invokes only > MPI_Comm_connect(). He is connecting to communicator, in which all other > players are already involved and therefore this last client doesn't have to > invoke MPI_Comm_accept(). > > Am I still missing something in this my reflection? > > Matus > > 2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp > <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>: > >> here is what the client is doing >> >> printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size, >> rank) ; >> >> for (i = rank ; i < num_clients ; i++) >> { >> /* client performs a collective accept */ >> CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, intracomm, >> &intercomm)) ; >> >> printf("CLIENT: connected to server on port\n") ; >> [...] >> >> } >> >> 2) has rank 1 >> >> /* and 3) has rank 2) */ >> >> so unless you run 2) with num_clients=2, MPI_Comm_accept() is never >> called, hence my analysis of the crash/hang >> >> >> I understand what you are trying to achieve, keep in mind >> MPI_Comm_accept() is a collective call, so when a new player >> >> is willing to join, other players must invoke MPI_Comm_accept(). >> >> and it is up to you to make sure that happens >> >> >> Cheers, >> >> >> Gilles >> >> On 7/19/2016 5:48 PM, M. D. wrote: >> >> >> >> 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp >> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>: >> >>> MPI_Comm_accept must be called by all the tasks of the local >>> communicator. >>> >> Yes, that's how I understand it. In the source code of the test, all the >> tasks call MPI_Comm_accept - server and also relevant clients. >> >>> so if you >>> >>> 1) mpirun -np 1 ./singleton_client_server 2 1 >>> >>> 2) mpirun -np 1 ./singleton_client_server 2 0 >>> >>> 3) mpirun -np 1 ./singleton_client_server 2 0 >>> >>> then 3) starts after 2) has exited, so on 1), intracomm is made of 1) >>> and an exited task (2) >>> >> This is not true in my opinion - because of above mentioned fact that >> MPI_Comm_accept is called by all the tasks of the local communicator. >> >>> /* >>> >>> strictly speaking, there is a race condition, if 2) has exited, then >>> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined. >>> >>> if 2) has not yet exited, then the test will hang because 2) does not >>> invoke MPI_Comm_accept >>> >>> */ >>> >> Task 2) does not exit, because of blocking call of MPI_Comm_accept. >> >>> >>> >> >>> there are different ways of seeing things : >>> >>> 1) this is an incorrect usage of the test, the number of clients should >>> be the same everywhere >>> >>> 2) task 2) should not exit (because it did not call >>> MPI_Comm_disconnect()) and the test should hang when >>> >>> starting task 3) because task 2) does not call MPI_Comm_accept() >>> >>> >>> ad 1) I am sorry, but maybe I do not understand what you think - In my >> previous post I wrote that the number of clients is the same in every >> mpirun instance. >> ad 2) it is the same as above >> >>> i do not know how you want to spawn your tasks. >>> >>> if 2) and 3) do not need to communicate with each other (they only >>> communicate with 1)), then >>> >>> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1) >>> >>> if 2 and 3) need to communicate with each other, it would be much easier >>> to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1), >>> >>> so there is only one inter communicator with all the tasks. >>> >> My aim is that all the tasks need to communicate with each other. I am >> implementing a distributed application - game with more players >> communicating with each other via MPI. It should work as follows - First >> player creates a game and waits for other players to connect to this game. >> On different computers (in the same network) the other players can join >> this game. When they are connected, they should be able to play this game >> together. >> I hope, it is clear what my idea is. If it is not, just ask me, please. >> >>> >>> The current test program is growing incrementally the intercomm, which >>> does require extra steps for synchronization. >>> >>> >>> Cheers, >>> >>> >>> Gilles >>> >> Cheers, >> >> Matus >> >>> On 7/19/2016 4:37 PM, M. D. wrote: >>> >>> Hi, >>> thank you for your interest in this topic. >>> >>> So, I normally run the test as follows: >>> Firstly, I run "server" (second parameter is 1): >>> *mpirun -np 1 ./singleton_client_server number_of_clients 1* >>> >>> Secondly, I run corresponding number of "clients" via following command: >>> *mpirun -np 1 ./singleton_client_server number_of_clients 0* >>> >>> So, for example with 3 clients I do: >>> mpirun -np 1 ./singleton_client_server 3 1 >>> mpirun -np 1 ./singleton_client_server 3 0 >>> mpirun -np 1 ./singleton_client_server 3 0 >>> mpirun -np 1 ./singleton_client_server 3 0 >>> >>> It means you are right - there should be the same number of clients in >>> each mpirun instance. >>> >>> The test does not involve MPI_Comm_disconnect(), but the problem in the >>> test is in the earlier position, because some of clients (in the most cases >>> actually the last client) cannot sometimes connect to the server and >>> therefore all clients with server are hanging (waiting for the connections >>> with the last client(s) ). >>> >>> So, the bahaviour of accept/connect method is a bit confusing for me. >>> If I understand you, according to your post - the problem is not in the >>> timeout value, isn't it? >>> >>> Cheers, >>> >>> Matus >>> >>> 2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet < >>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>gil...@rist.or.jp >>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>: >>> >>>> How do you run the test ? >>>> >>>> you should have the same number of clients in each mpirun instance, the >>>> following simple shell starts the test as i think it is supposed to >>>> >>>> note the test itself is arguable since MPI_Comm_disconnect() is never >>>> invoked >>>> >>>> (and you will observe some related dpm_base_disconnect_init errors) >>>> >>>> >>>> #!/bin/sh >>>> >>>> clients=3 >>>> >>>> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients >>>> 1 2>&1 | tee /tmp/server.$clients" >>>> for i in $(seq $clients); do >>>> >>>> sleep 1 >>>> >>>> screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients >>>> 0 2>&1 | tee /tmp/client.$clients.$i" >>>> done >>>> >>>> >>>> Ralph, >>>> >>>> >>>> this test fails with master. >>>> >>>> when the "server" (second parameter is 1), MPI_Comm_accept() fails with >>>> a timeout. >>>> >>>> i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout >>>> >>>> OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60); >>>> >>>> but this is not the timeout that is triggered ... >>>> >>>> the eviction_cbfunc timeout function is invoked, and it has been set >>>> when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c >>>> >>>> >>>> default timeout is 2 seconds, but in this case (user invokes >>>> MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds >>>> (hard coded value described above) >>>> >>>> sadly, if i set a higher timeout value (mpirun --mca >>>> orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when >>>> the client invokes MPI_Comm_connect() >>>> >>>> >>>> could you please have a look at this ? >>>> >>>> >>>> Cheers, >>>> >>>> >>>> Gilles >>>> >>>> On 7/15/2016 9:20 PM, M. D. wrote: >>>> >>>> Hello, >>>> >>>> I have a problem with basic client - server application. I tried to run >>>> C program from this website >>>> <https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c> >>>> https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c >>>> I saw this program mentioned in many discussions in your website, so I >>>> expected that it should work properly, but after more testing I found out >>>> that there is probably an error somewhere in connect/accept method. I have >>>> read many discussions and threads on your website, but I have not found >>>> similar problem that I am facing. It seems that nobody had similar problem >>>> like me. When I run this app with one server and more clients (3,4,5,6,...) >>>> sometimes the app hangs. It hangs when second or next client wants to >>>> connect to the server (it depends, sometimes third client hangs, sometimes >>>> fourth, sometimes second, and so on). >>>> So it means that app starts to hang where server waits for accept and >>>> client waits for connect. And it is not possible to continue, because this >>>> client cannot connect to the server. It is strange, because I observed this >>>> behaviour only in some cases... Sometimes it works without any problems, >>>> sometimes it does not work. The behaviour is unpredictable and not stable. >>>> >>>> I have installed openmpi 1.10.2 on my Fedora 19. I have the same >>>> problem with Java alternative of this application. It hangs also >>>> sometimes... I need this app in Java, but firstly it must work properly in >>>> C implementation. Because of this strange behaviour I assume that there can >>>> be an error maybe inside of openmpi implementation of connect/accept >>>> methods. I tried it also with another version of openmpi - 1.8.1. However, >>>> the problem did not disappear. >>>> >>>> Could you help me, what can cause the problem? Maybe I did not get >>>> something about openmpi (or connect/server) and the problem is with me... I >>>> will appreciate any your help, support, or interest about this topic. >>>> >>>> Best regards, >>>> Matus Dobrotka >>>> >>>> >>>> _______________________________________________ >>>> users mailing listus...@open-mpi.org >>>> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2016/07/29673.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> <http://www.open-mpi.org/community/lists/users/2016/07/29681.php> >>>> http://www.open-mpi.org/community/lists/users/2016/07/29681.php >>>> >>> >>> >>> >>> _______________________________________________ >>> users mailing listus...@open-mpi.org >>> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/07/29687.php >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2016/07/29688.php >>> >> >> >> >> _______________________________________________ >> users mailing listus...@open-mpi.org >> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/07/29689.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/07/29690.php >> > >