How do you run the test ?
you should have the same number of clients in each mpirun instance, the
following simple shell starts the test as i think it is supposed to
note the test itself is arguable since MPI_Comm_disconnect() is never
invoked
(and you will observe some related dpm_base_disconnect_init errors)
#!/bin/sh
clients=3
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
1 2>&1 | tee /tmp/server.$clients"
for i in $(seq $clients); do
sleep 1
screen -d -m sh -c "mpirun -np 1 ./singleton_client_server $clients
0 2>&1 | tee /tmp/client.$clients.$i"
done
Ralph,
this test fails with master.
when the "server" (second parameter is 1), MPI_Comm_accept() fails with
a timeout.
i ompi/dpm/dpm.c, there is a hard coded 60 seconds timeout
OPAL_PMIX_EXCHANGE(rc, &info, &pdat, 60);
but this is not the timeout that is triggered ...
the eviction_cbfunc timeout function is invoked, and it has been set
when opal_hotel_init() was invoked in orte/orted/pmix/pmix_server.c
default timeout is 2 seconds, but in this case (user invokes
MPI_Comm_accept), i guess the timeout should be infinite or 60 seconds
(hard coded value described above)
sadly, if i set a higher timeout value (mpirun --mca
orte_pmix_server_max_wait 180 ...), MPI_Comm_accept() does not return
when the client invokes MPI_Comm_connect()
could you please have a look at this ?
Cheers,
Gilles
On 7/15/2016 9:20 PM, M. D. wrote:
Hello,
I have a problem with basic client - server application. I tried to
run C program from this website
https://github.com/hpc/cce-mpi-openmpi-1.7.1/blob/master/orte/test/mpi/singleton_client_server.c
I saw this program mentioned in many discussions in your website, so I
expected that it should work properly, but after more testing I found
out that there is probably an error somewhere in connect/accept
method. I have read many discussions and threads on your website, but
I have not found similar problem that I am facing. It seems that
nobody had similar problem like me. When I run this app with one
server and more clients (3,4,5,6,...) sometimes the app hangs. It
hangs when second or next client wants to connect to the server (it
depends, sometimes third client hangs, sometimes fourth, sometimes
second, and so on).
So it means that app starts to hang where server waits for accept and
client waits for connect. And it is not possible to continue, because
this client cannot connect to the server. It is strange, because I
observed this behaviour only in some cases... Sometimes it works
without any problems, sometimes it does not work. The behaviour is
unpredictable and not stable.
I have installed openmpi 1.10.2 on my Fedora 19. I have the same
problem with Java alternative of this application. It hangs also
sometimes... I need this app in Java, but firstly it must work
properly in C implementation. Because of this strange behaviour I
assume that there can be an error maybe inside of openmpi
implementation of connect/accept methods. I tried it also with another
version of openmpi - 1.8.1. However, the problem did not disappear.
Could you help me, what can cause the problem? Maybe I did not get
something about openmpi (or connect/server) and the problem is with
me... I will appreciate any your help, support, or interest about this
topic.
Best regards,
Matus Dobrotka
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/07/29673.php