Hi, thank you for your interest in this topic. So, I normally run the test as follows: Firstly, I run "server" (second parameter is 1): *mpirun -np 1 ./singleton_client_server number_of_clients 1*
Secondly, I run corresponding number of "clients" via following command: *mpirun -np 1 ./singleton_client_server number_of_clients 0* So, for example with 3 clients I do: mpirun -np 1 ./singleton_client_server 3 1 mpirun -np 1 ./singleton_client_server 3 0 mpirun -np 1 ./singleton_client_server 3 0 mpirun -np 1 ./singleton_client_server 3 0 It means you are right - there should be the same number of clients in each mpirun instance. The test does not involve MPI_Comm_disconnect(), but the problem in the test is in the earlier position, because some of clients (in the most cases actually the last client) cannot sometimes connect to the server and therefore all clients with server are hanging (waiting for the connections with the last client(s) ). So, the bahaviour of accept/connect method is a bit confusing for me. If I understand you, according to your post - the problem is not in the timeout value, isn't it? Cheers, Matus 2016-07-19 6:28 GMT+02:00 Gilles Gouaillardet <gil...@rist.or.jp>: > How do you run the test ? > > you should have the same number of clients in each mpirun instance, the > following simple shell starts the test as i think it is supposed to > > note the test itself is arguable since MPI_Comm_disconnect() is never > invoked > > (and you will observe some related dpm_base_disconnect_init errors) > > > #!/bin/sh > > clients=3 > > screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 1 > 2>&1 | tee /tmp/server.$clients" > for i in $(seq $clients); do > > sleep 1 > > screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients 0 > 2>&1 | tee /tmp/client.$clients.$i" > done > > > Ralph, > > > this test fails with master. > > when the "server" (second parameter is 1), MPI_Comm_accept() fails with a > timeout. > > i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout > > OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60); > > but this is not the timeout that is triggered ... > > the eviction_cbfunc timeout function is invoked, and it has been set when > opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c > > > default timeout is 2 seconds, but in this case (user invokes > MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds > (hard coded value described above) > > sadly, if i set a higher timeout value (mpirun --mca > orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return when > the client invokes MPI_Comm_connect() > > > could you please have a look at this ? > > > Cheers, > > > Gilles > > On 7/15/2016 9:20 PM, M. D. wrote: > > Hello, > > I have a problem with basic client - server application. I tried to run C > program from this website > <https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c> > https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c > I saw this program mentioned in many discussions in your website, so I > expected that it should work properly, but after more testing I found out > that there is probably an error somewhere in connect/accept method. I have > read many discussions and threads on your website, but I have not found > similar problem that I am facing. It seems that nobody had similar problem > like me. When I run this app with one server and more clients (3,4,5,6,...) > sometimes the app hangs. It hangs when second or next client wants to > connect to the server (it depends, sometimes third client hangs, sometimes > fourth, sometimes second, and so on). > So it means that app starts to hang where server waits for accept and > client waits for connect. And it is not possible to continue, because this > client cannot connect to the server. It is strange, because I observed this > behaviour only in some cases... Sometimes it works without any problems, > sometimes it does not work. The behaviour is unpredictable and not stable. > > I have installed openmpi 1.10.2 on my Fedora 19. I have the same problem > with Java alternative of this application. It hangs also sometimes... I > need this app in Java, but firstly it must work properly in C > implementation. Because of this strange behaviour I assume that there can > be an error maybe inside of openmpi implementation of connect/accept > methods. I tried it also with another version of openmpi - 1.8.1. However, > the problem did not disappear. > > Could you help me, what can cause the problem? Maybe I did not get > something about openmpi (or connect/server) and the problem is with me... I > will appreciate any your help, support, or interest about this topic. > > Best regards, > Matus Dobrotka > > > _______________________________________________ > users mailing listus...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/07/29673.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/07/29681.php >