my bad for the confusion,

I misread you and miswrote my reply.

I will investigate this again.

strictly speaking, the clients can only start after the server first write
the port info to a file.
if you start the client right after the server start, they might use
incorrect/outdated info and cause all the test hang.

I will start reproducing the hang

Cheers,

Gilles

On Tuesday, July 19, 2016, M. D. <matus.dobro...@gmail.com> wrote:

> Yes I understand it, but I think, this is exactly that situation you are
> talking about. In my opinion, the test is doing exactly what you said -
> when a new player is willing to join, other players must invoke 
> MPI_Comm_accept().
> All *other* players must invoke MPI_Comm_accept(). Only the last client
> (in this case last player which wants to join) does not
> invoke MPI_Comm_accept(), because this client invokes only
> MPI_Comm_connect(). He is connecting to communicator, in which all other
> players are already involved and therefore this last client doesn't have to
> invoke MPI_Comm_accept().
>
> Am I still missing something in this my reflection?
>
> Matus
>
> 2016-07-19 10:55 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp
> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>:
>
>> here is what the client is doing
>>
>>     printf("CLIENT: after merging, new comm: size=%d rank=%d\n", size,
>> rank) ;
>>
>>     for (i = rank ; i < num_clients ; i++)
>>     {
>>       /* client performs a collective accept */
>>       CHK(MPI_Comm_accept(server_port_name, MPI_INFO_NULL, 0, intracomm,
>> &intercomm)) ;
>>
>>       printf("CLIENT: connected to server on port\n") ;
>>       [...]
>>
>>     }
>>
>> 2) has rank 1
>>
>> /* and 3) has rank 2) */
>>
>> so unless you run 2) with num_clients=2, MPI_Comm_accept() is never
>> called, hence my analysis of the crash/hang
>>
>>
>> I understand what you are trying to achieve, keep in mind
>> MPI_Comm_accept() is a collective call, so when a new player
>>
>> is willing to join, other players must invoke MPI_Comm_accept().
>>
>> and it is up to you to make sure that happens
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>> On 7/19/2016 5:48 PM, M. D. wrote:
>>
>>
>>
>> 2016-07-19 10:06 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp
>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>:
>>
>>> MPI_Comm_accept must be called by all the tasks of the local
>>> communicator.
>>>
>> Yes, that's how I understand it. In the source code of the test, all the
>> tasks call  MPI_Comm_accept - server and also relevant clients.
>>
>>> so if you
>>>
>>> 1) mpirun -np 1 ./singleton_client_server 2 1
>>>
>>> 2) mpirun -np 1 ./singleton_client_server 2 0
>>>
>>> 3) mpirun -np 1 ./singleton_client_server 2 0
>>>
>>> then 3) starts after 2) has exited, so on 1), intracomm is made of 1)
>>> and an exited task (2)
>>>
>> This is not true in my opinion -  because of above mentioned fact that
>> MPI_Comm_accept is called by all the tasks of the local communicator.
>>
>>> /*
>>>
>>> strictly speaking, there is a race condition, if 2) has exited, then
>>> MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.
>>>
>>> if 2) has not yet exited, then the test will hang because 2) does not
>>> invoke MPI_Comm_accept
>>>
>>> */
>>>
>> Task 2) does not exit, because of blocking call of MPI_Comm_accept.
>>
>>>
>>>
>>
>>> there are different ways of seeing things :
>>>
>>> 1) this is an incorrect usage of the test, the number of clients should
>>> be the same everywhere
>>>
>>> 2) task 2) should not exit (because it did not call
>>> MPI_Comm_disconnect()) and the test should hang when
>>>
>>> starting task 3) because task 2) does not call MPI_Comm_accept()
>>>
>>>
>>> ad 1) I am sorry, but maybe I do not understand what you think - In my
>> previous post I wrote that the number of clients is the same in every
>> mpirun instance.
>> ad 2) it is the same as above
>>
>>> i do not know how you want to spawn your tasks.
>>>
>>> if 2) and 3) do not need to communicate with each other (they only
>>> communicate with 1)), then
>>>
>>> you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)
>>>
>>> if 2 and 3) need to communicate with each other, it would be much easier
>>> to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),
>>>
>>> so there is only one inter communicator with all the tasks.
>>>
>> My aim is that all the tasks need to communicate with each other. I am
>> implementing a distributed application - game with more players
>> communicating with each other via MPI. It should work as follows - First
>> player creates a game and waits for other players to connect to this game.
>> On different computers (in the same network) the other players can join
>> this game. When they are connected, they should be able to play this game
>> together.
>> I hope, it is clear what my idea is. If it is not, just ask me, please.
>>
>>>
>>> The current test program is growing incrementally the intercomm, which
>>> does require extra steps for synchronization.
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>> Cheers,
>>
>> Matus
>>
>>> On 7/19/2016 4:37 PM, M. D. wrote:
>>>
>>> Hi,
>>> thank you for your interest in this topic.
>>>
>>> So, I normally run the test as follows:
>>> Firstly, I run "server" (second parameter is 1):
>>> *mpirun -np 1 ./singleton_client_server number_of_clients 1*
>>>
>>> Secondly, I run corresponding number of "clients" via following command:
>>> *mpirun -np 1 ./singleton_client_server number_of_clients 0*
>>>
>>> So, for example with 3 clients I do:
>>> mpirun -np 1 ./singleton_client_server 3 1
>>> mpirun -np 1 ./singleton_client_server 3 0
>>> mpirun -np 1 ./singleton_client_server 3 0
>>> mpirun -np 1 ./singleton_client_server 3 0
>>>
>>> It means you are right - there should be the same number of clients in
>>> each mpirun instance.
>>>
>>> The test does not involve MPI_Comm_disconnect(), but the problem in the
>>> test is in the earlier position, because some of clients (in the most cases
>>> actually the last client) cannot sometimes connect to the server and
>>> therefore all clients with server are hanging (waiting for the connections
>>> with the last client(s) ).
>>>
>>> So, the bahaviour of accept/connect method is a bit confusing for me.
>>> If I understand you, according to your post - the problem is not in the
>>> timeout value, isn't it?
>>>
>>> Cheers,
>>>
>>> Matus
>>>
>>> 2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet <
>>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>gil...@rist.or.jp
>>> <javascript:_e(%7B%7D,'cvml','gil...@rist.or.jp');>>:
>>>
>>>> How do you run the test ?
>>>>
>>>> you should have the same number of clients in each mpirun instance, the
>>>> following simple shell starts the test as i think it is supposed to
>>>>
>>>> note the test itself is arguable since MPI_Comm_disconnect() is never
>>>> invoked
>>>>
>>>> (and you will observe some related dpm_base_disconnect_init errors)
>>>>
>>>>
>>>> #!/bin/sh
>>>>
>>>> clients=3
>>>>
>>>>     screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
>>>> 1 2>&1 | tee /tmp/server.$clients"
>>>> for i in $(seq $clients); do
>>>>
>>>>     sleep 1
>>>>
>>>>     screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
>>>> 0 2>&1 | tee /tmp/client.$clients.$i"
>>>> done
>>>>
>>>>
>>>> Ralph,
>>>>
>>>>
>>>> this test fails with master.
>>>>
>>>> when the "server" (second parameter is 1), MPI_Comm_accept() fails with
>>>> a timeout.
>>>>
>>>> i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
>>>>
>>>> OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
>>>>
>>>> but this is not the timeout that is triggered ...
>>>>
>>>> the eviction_cbfunc timeout function is invoked, and it has been set
>>>> when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
>>>>
>>>>
>>>> default timeout is 2 seconds, but in this case (user invokes
>>>> MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
>>>> (hard coded value described above)
>>>>
>>>> sadly, if i set a higher timeout value (mpirun --mca
>>>> orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when
>>>> the client invokes MPI_Comm_connect()
>>>>
>>>>
>>>> could you please have a look at this ?
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Gilles
>>>>
>>>> On 7/15/2016 9:20 PM, M. D. wrote:
>>>>
>>>> Hello,
>>>>
>>>> I have a problem with basic client - server application. I tried to run
>>>> C program from this website
>>>> <https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c>
>>>> https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
>>>> I saw this program mentioned in many discussions in your website, so I
>>>> expected that it should work properly, but after more testing I found out
>>>> that there is probably an error somewhere in connect/accept method. I have
>>>> read many discussions and threads on your website, but I have not found
>>>> similar problem that I am facing. It seems that nobody had similar problem
>>>> like me. When I run this app with one server and more clients (3,4,5,6,...)
>>>> sometimes the app hangs. It hangs when second or next client wants to
>>>> connect to the server (it depends, sometimes third client hangs, sometimes
>>>> fourth, sometimes second, and so on).
>>>> So it means that app starts to hang where server waits for accept and
>>>> client waits for connect. And it is not possible to continue, because this
>>>> client cannot connect to the server. It is strange, because I observed this
>>>> behaviour only in some cases... Sometimes it works without any problems,
>>>> sometimes it does not work. The behaviour is unpredictable and not stable.
>>>>
>>>> I have installed openmpi 1.10.2 on my Fedora 19. I have the same
>>>> problem with Java alternative of this application. It hangs also
>>>> sometimes... I need this app in Java, but firstly it must work properly in
>>>> C implementation. Because of this strange behaviour I assume that there can
>>>> be an error maybe inside of openmpi implementation of connect/accept
>>>> methods. I tried it also with another version of openmpi - 1.8.1. However,
>>>> the problem did not disappear.
>>>>
>>>> Could you help me, what can cause the problem? Maybe I did not get
>>>> something about openmpi (or connect/server) and the problem is with me... I
>>>> will appreciate any your help, support, or interest about this topic.
>>>>
>>>> Best regards,
>>>> Matus Dobrotka
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing listus...@open-mpi.org 
>>>> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2016/07/29673.php
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> <http://www.open-mpi.org/community/lists/users/2016/07/29681.php>
>>>> http://www.open-mpi.org/community/lists/users/2016/07/29681.php
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing listus...@open-mpi.org 
>>> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2016/07/29687.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2016/07/29688.php
>>>
>>
>>
>>
>> _______________________________________________
>> users mailing listus...@open-mpi.org 
>> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/07/29689.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/07/29690.php
>>
>
>

Reply via email to