Good Open MPI gurus,
I've further reduced the size of the experiment that reproduces the
problem. My array of requests now has just 10 entries, and by the time
the server gets stuck in MPI_Waitany(), and three of the clients are
stuck in MPI_Recv(), the array has three unprocessed Isend()'s and
three unprocessed Irecv()'s.
I've upgraded to Open MPI 1.2.4, but this made no difference.
Are there any internal logging or debugging facilities in Open MPI that
would allow me to further track the calls that eventually result in the
error in mca_btl_tcp_frag_recv() ?
Thanks,
Daniel
Daniel Rozenbaum wrote:
Here's some more info on the problem I've been struggling with;
my
apologies for the lengthy posts, but I'm a little desperate here :-)
I was able to reduce the size of the experiment that reproduces the
problem, both in terms of input data size and the number of slots in
the cluster. The cluster now consists of 6 slots (5 clients), with two
of the clients running on the same node as the server and three others
on another node. This allowed me to follow Brian's
advice and run the server and all the clients under gdb and make
sure none of the processes terminates (normally or abnormally) when the
server reports the "readv failed" errors; this is indeed the case.
I then followed Jeff's
advice and added a debug loop just prior to the server calling
MPI_Waitany(), identifying the entries in the requests array which are
not
MPI_REQUEST_NULL, and then tracing back these
requests. What I found was the following:
At some point during the run, the server calls MPI_Waitany() on an
array of requests consisting of 96 elements, and gets stuck in it
forever; the only thing that happens at some point thereafter is that
the server reports a couple of "readv failed" errors:
[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110
[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110
According to my debug prints, just before that last call to
MPI_Waitany() the array requests[] contains 38 entries which are not
MPI_REQUEST_NULL. Half of these entries correspond to calls to Isend(),
half to Irecv(). Specifically, for example, entries
4,14,24,34,44,54,64,74,84,94 are used for Isend()'s from server to
client #3 (of 5), and entries 5,15,...,95 are used for Irecv() for the
same client.
I traced back what's going on, for instance, with requests[4]. As I
mentioned, it corresponds to a call to MPI_Isend() initiated by the
server to client #3 (of 5). By the time the server gets stuck in
Waitany(), this client has already correctly processed the first
Isend() from master in requests[4], returned its response in
requests[5], and the server received this response properly. After
receiving this response, the server Isend()'s the next task to this
client in requests[4], and this is correctly reflected in "requests[4]
!= MPI_REQUESTS_NULL" just before the last call to Waitany(), but for
some reason this send doesn't seem to go any further.
Looking at all other requests[] corresponding to Isend()'s initiated by
the server to the same client (14,24,...,94), they're all also not
MPI_REQUEST_NULL, and are not going any further either.
One thing that might be important is that the messages the server is
sending to the clients in my experiment are quite large, ranging from
hundreds of Kbytes to several Mbytes, the largest being around 9
Mbytes. The largest messages take place at the beginning of the run and
are processed correctly though.
Also, I ran the same experiment on another cluster that uses slightly
different
hardware and network infrastructure, and could not reproduce the
problem.
Hope at least some of the above makes some sense. Any additional advice
would be greatly appreciated!
Many thanks,
Daniel
|