MPI_Comm_accept must be called by all the tasks of the local communicator.

so if you

1) mpirun -np 1 ./singleton_client_server 2 1

2) mpirun -np 1 ./singleton_client_server 2 0

3) mpirun -np 1 ./singleton_client_server 2 0

then 3) starts after 2) has exited, so on 1), intracomm is made of 1) and an exited task (2)

/*

strictly speaking, there is a race condition, if 2) has exited, then MPI_Comm_accept will crash when 1) informs 2) that 3) has joined.

if 2) has not yet exited, then the test will hang because 2) does not invoke MPI_Comm_accept

*/


there are different ways of seeing things :

1) this is an incorrect usage of the test, the number of clients should be the same everywhere

2) task 2) should not exit (because it did not call MPI_Comm_disconnect()) and the test should hang when

starting task 3) because task 2) does not call MPI_Comm_accept()


i do not know how you want to spawn your tasks.

if 2) and 3) do not need to communicate with each other (they only communicate with 1)), then

you can simply MPI_Comm_accept(MPI_COMM_WORLD) in 1)

if 2 and 3) need to communicate with each other, it would be much easier to MPI_Comm_spawn or MPI_Comm_spawn_multiple only once in 1),

so there is only one inter communicator with all the tasks.


The current test program is growing incrementally the intercomm, which does require extra steps for synchronization.


Cheers,


Gilles

On 7/19/2016 4:37 PM, M. D. wrote:
Hi,
thank you for your interest in this topic.

So, I normally run the test as follows:
Firstly, I run "server" (second parameter is 1):
*mpirun -np 1 ./singleton_client_server number_of_clients 1*
*
*
Secondly, I run corresponding number of "clients" via following command:
*mpirun -np 1 ./singleton_client_server number_of_clients 0*
*
*
So, for example with 3 clients I do:
mpirun -np 1 ./singleton_client_server 3 1
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0
mpirun -np 1 ./singleton_client_server 3 0

It means you are right - there should be the same number of clients in each mpirun instance.

The test does not involve MPI_Comm_disconnect(), but the problem in the test is in the earlier position, because some of clients (in the most cases actually the last client) cannot sometimes connect to the server and therefore all clients with server are hanging (waiting for the connections with the last client(s) ).

So, the bahaviour of accept/connect method is a bit confusing for me.
If I understand you, according to your post - the problem is not in the timeout value, isn't it?

Cheers,

Matus

2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>>:

    How do you run the test ?

    you should have the same number of clients in each mpirun
    instance, the following simple shell starts the test as i think it
    is supposed to

    note the test itself is arguable since MPI_Comm_disconnect() is
    never invoked

    (and you will observe some related dpm_base_disconnect_init errors)


    #!/bin/sh

    clients=3

        screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
    $clients 1 2>&1 | tee /tmp/server.$clients"
    for i in $(seq $clients); do

        sleep 1

        screen -d -m sh -c "mpirun -np 1 ./singleton_client_server
    $clients 0 2>&1 | tee /tmp/client.$clients.$i"
    done


    Ralph,


    this test fails with master.

    when the "server" (second parameter is 1), MPI_Comm_accept() fails
    with a timeout.

    i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout

    OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);

    but this is not the timeout that is triggered ...

    the eviction_cbfunc timeout function is invoked, and it has been
    set when opal_hotel_init() was invoked in
    orte/orted/pmix/pmix_server.c


    default timeout is 2 seconds, but in this case (user invokes
    MPI_Comm_accept), i guess the timeout should be infinite or 60
    seconds (hard coded value described above)

    sadly, if i set a higher timeout value (mpirun --mca
    orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not
    return when the client invokes MPI_Comm_connect()


    could you please have a look at this ?


    Cheers,


    Gilles


    On 7/15/2016 9:20 PM, M. D. wrote:
    Hello,

    I have a problem with basic client - server application. I tried
    to run C program from this website
    
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
    I saw this program mentioned in many discussions in your website,
    so I expected that it should work properly, but after more
    testing I found out that there is probably an error somewhere in
    connect/accept method. I have read many discussions and threads
    on your website, but I have not found similar problem that I am
    facing. It seems that nobody had similar problem like me. When I
    run this app with one server and more clients (3,4,5,6,...)
    sometimes the app hangs. It hangs when second or next client
    wants to connect to the server (it depends, sometimes third
    client hangs, sometimes fourth, sometimes second, and so on).
    So it means that app starts to hang where server waits for accept
    and client waits for connect. And it is not possible to continue,
    because this client cannot connect to the server. It is strange,
    because I observed this behaviour only in some cases... Sometimes
    it works without any problems, sometimes it does not work. The
    behaviour is unpredictable and not stable.

    I have installed openmpi 1.10.2 on my Fedora 19. I have the same
    problem with Java alternative of this application. It hangs also
    sometimes... I need this app in Java, but firstly it must work
    properly in C implementation. Because of this strange behaviour I
    assume that there can be an error maybe inside of openmpi
    implementation of connect/accept methods. I tried it also with
    another version of openmpi - 1.8.1. However, the problem did not
    disappear.

    Could you help me, what can cause the problem? Maybe I did not
    get something about openmpi (or connect/server) and the problem
    is with me... I will appreciate any your help, support, or
    interest about this topic.

    Best regards,
    Matus Dobrotka


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription:https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this 
post:http://www.open-mpi.org/community/lists/users/2016/07/29673.php


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2016/07/29681.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/07/29687.php

Reply via email to