[OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv

2007-09-12 Thread Daniel Rozenbaum
Hello,

I'm working on an MPI application for which I recently started using Open MPI 
instead of LAM/MPI. Both with Open MPI and LAM/MPI it mostly runs ok, but 
there're a number of cases under which the application terminates abnormally 
when using LAM/MPI, and hangs when using Open MPI. I haven't been able to 
reduce the example reproducing the problem, so every time it takes about an 
hour of running time before the application hangs. It hangs right before it's 
supposed to end properly. The master and all the slave processes are showing in 
"top" consuming 100% CPU. The application just hangs there like that until I 
interrupt it.

Here's the command line:

orterun --prefix /path/to/openmpi -mca btl tcp,self -x PATH -x LD_LIBRARY_PATH 
--hostfile hostfile1 /path/to/app_executable 

hostfile1:

host1 slots=3
host2 slots=4
host3 slots=4
host4 slots=4
host5 slots=4
host6 slots=4
host7 slots=4
host8 slots=4
host9 slots=4
host10 slots=4
host11 slots=4
host12 slots=4
host13 slots=4
host14 slots=4

Each host is a dual-CPU dual-core Intel box running Red Hat Enterprise Server 4.


I caught the following error messages on app's stderr during the run:

[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: 
readv failed with errno=110
[host8][0,1,29][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed with errno=113

[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: 
readv failed with errno=110


Excerpts from strace output, and ompi_info are attached below.
Any advice would be greatly appreciated!
Thanks in advance,
Daniel


strace on the orterun process:

poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=5, events=POLLIN}, 
{fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, 
events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, 
events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, 
events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, 
events=POLLIN}, {fd=0, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, 
events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, 
events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, 
events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, 
events=POLLIN}, {fd=34, events=POLLIN}, {fd=33, events=POLLIN}, {fd=32, 
events=POLLIN}, {fd=35, events=POLLIN}, ...], 71, 1000) = 0
rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0
rt_sigaction(SIGUSR1, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0
rt_sigaction(SIGUSR2, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0
sched_yield()   = 0
rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0
rt_sigaction(SIGTERM, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0
rt_sigaction(SIGUSR1, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0
rt_sigaction(SIGUSR2, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], 
SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0
poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=5, events=POLL



strace on the master process:

rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2a972cae70, [CHLD], SA_RESTORER|SA_RESTART, 
0x3fdf80c4f0}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2a972cae70, [CHLD], SA_RESTORER|SA_RESTART, 
0x3fdf80c4f0}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, 
{fd=8, events=POLLIN}, {fd=14, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, 
events=POLLIN}, {fd=13, events=POLLIN}, {fd=16, events=POLLIN}, {fd=15, 
events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, 
events=POLLIN}, {fd=23, events=POLLIN}, {fd=67, events=POLLIN}, {fd=25, 
events=POLLIN}, {fd=66, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, 
events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, 
events=POLLIN}, {fd=31, events=POLLIN}, {fd=32, events=POLLIN}, {fd=33, 
events=POLLIN}, {fd=34, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, 
events=POLLIN}, {

Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv

2007-09-17 Thread Daniel Rozenbaum

Jeff, thanks a lot for taking the time,

I looked into this some more and this could very well be a side effect 
of a problem in my code, maybe a memory violation that messes things up; 
I'm going to valgrind this thing and see what comes up. Most of the time 
the app runs just fine, so I'm not sure if it could also be a problem in 
the MPI messages logic in my code; could be though.


What seems to be happening is this: the code of the server is written in 
such a manner that the server knows how many "responses" it's supposed 
to receive from all the clients, so when all the calculation tasks have 
been distributed, the server enters a loop inside which it calls 
MPI_Waitany on an array of handles until it receives all the results it 
expects. However, from my debug prints it looks like all the clients 
think they've sent all the results they could, and they're now all 
sitting in MPI_Probe, waiting for the server to send out the next 
instruction (which is supposed to contain a message indicating the end 
of the run). So, the server is stuck in MPI_Waitany() while all the 
clients are stuck in MPI_Probe().



I was wondering if you could comment on the "readv failed" messages I'm 
seeing in the server's stderr:


[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed with errno=110


I'm seeing a few of these along the server's run, with errno=110 
("Connection timed out" according to the "perl -e 'die$!=errno'" method 
I found in OpenMPI FAQs), and I've also seen errno=113 ("No route to 
host"). Could this mean there's an occasional infrastructure problem? It 
would be strange, as it would then seem that this particular run somehow 
triggers it?.. Could these messages also mean that some messages got 
lost due to these errors, and that's why the server thinks it still has 
some results to receive while the clients think they've sent everything out?


Many thanks,
Daniel



Jeff Squyres wrote:
It sounds like we have a missed corner case of the OMPI run-time not  
cleaning properly.  I know one case like this came up recently (if an  
app calls exit() without calling MPI_FINALIZE, OMPI v1.2.x hangs) and  
Ralph is working on it.


This could well be what is happening here...?

Do you know how your process is exiting?  If a process dies via  
signal, OMPI *should* be seeing that and cleaning up the whole job  
properly.




On Sep 12, 2007, at 10:50 PM, Daniel Rozenbaum wrote:

  

Hello,

I'm working on an MPI application for which I recently started  
using Open MPI instead of LAM/MPI. Both with Open MPI and LAM/MPI  
it mostly runs ok, but there're a number of cases under which the  
application terminates abnormally when using LAM/MPI, and hangs  
when using Open MPI. I haven't been able to reduce the example  
reproducing the problem, so every time it takes about an hour of  
running time before the application hangs. It hangs right before  
it's supposed to end properly. The master and all the slave  
processes are showing in "top" consuming 100% CPU. The application  
just hangs there like that until I interrupt it.


Here's the command line:

orterun --prefix /path/to/openmpi -mca btl tcp,self -x PATH -x  
LD_LIBRARY_PATH --hostfile hostfile1 /path/to/app_executable params>


hostfile1:

host1 slots=3
host2 slots=4
host3 slots=4
host4 slots=4
host5 slots=4
host6 slots=4
host7 slots=4
host8 slots=4
host9 slots=4
host10 slots=4
host11 slots=4
host12 slots=4
host13 slots=4
host14 slots=4

Each host is a dual-CPU dual-core Intel box running Red Hat  
Enterprise Server 4.



I caught the following error messages on app's stderr during the run:

[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]  
mca_btl_tcp_frag_recv: readv failed with errno=110
[host8][0,1,29][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]  
mca_btl_tcp_frag_recv: readv failed with errno=113


[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]  
mca_btl_tcp_frag_recv: readv failed with errno=110



Excerpts from strace output, and ompi_info are attached below.
Any advice would be greatly appreciated!
Thanks in advance,
Daniel




ompi_info --all:


Open MPI: 1.2.3
   Open MPI SVN revision: r15136
Open RTE: 1.2.3
   Open RTE SVN revision: r15136
OPAL: 1.2.3
   OPAL SVN revision: r15136
   MCA backtrace: execinfo (MCA v1.0, API v1.0, Component  
v1.2.3)
  MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component  
v1.2.3)

   MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.3)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component  
v1.2.3)
   MCA maffinity: libnuma (MCA v1.0, API v1.0, Component  
v1.2.3)

   MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.3)
 MCA installdirs: env (MCA v1.

Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv

2007-09-19 Thread Daniel Rozenbaum




I'm now running the same experiment under valgrind. It's probably
going to run for a few days, but interestingly what I'm seeing now is
that while running under valgrind's memcheck, the app has been
reporting much more of these "recv failed" errors, and not only on the
server node:

[host1][0,1,0]
[host4][0,1,13]
[host5][0,1,18]
[host8][0,1,30]
[host10][0,1,36]
[host12][0,1,46]

If in the original run I got 3 such messages, in the valgrind'ed run I
got about 45 so far, and the app still has about 75% of the work left.

I'm checking while all this is happening, and all the client processes
are still running, none exited early.

I've been analyzing the debug output in my original experiment, and it
does look like the server never receives any new messages from two of
the clients after the "recv failed" messages appear. If my analysis is
correct, these two clients ran on the same host. It might be the case
then that the messages with the next tasks to execute that the server
attempted to send to these two clients never reached them, or were
never sent. Interesting though that there were two additional clients
on the same host, and those seem to have kept working all along, until
the app got stuck.

Once this valgrind experiment is over, I'll proceed to your other
suggestion about the debug loop on the server side checking for any of
the requests the app is waiting for being MPI_REQUEST_NULL.

Many thanks,
Daniel


Jeff Squyres wrote:

  On Sep 17, 2007, at 11:26 AM, Daniel Rozenbaum wrote:

  
  
What seems to be happening is this: the code of the server is  
written in
such a manner that the server knows how many "responses" it's supposed
to receive from all the clients, so when all the calculation tasks  
have
been distributed, the server enters a loop inside which it calls
MPI_Waitany on an array of handles until it receives all the  
results it
expects. However, from my debug prints it looks like all the clients
think they've sent all the results they could, and they're now all
sitting in MPI_Probe, waiting for the server to send out the next
instruction (which is supposed to contain a message indicating the end
of the run). So, the server is stuck in MPI_Waitany() while all the
clients are stuck in MPI_Probe().

  
  
On the server side, try putting in a debug loop and see if any of the  
requests that your app is waiting for are not MPI_REQUEST_NULL (it's  
not a value of 0 -- you'll need to compare against  
MPI_REQUEST_NULL).  If there are any, see if you can trace backwards  
to see what request it is.

  
  
I was wondering if you could comment on the "readv failed" messages  
I'm
seeing in the server's stderr:

[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110

I'm seeing a few of these along the server's run, with errno=110
("Connection timed out" according to the "perl -e 'die$!=errno'"  
method
I found in OpenMPI FAQs), and I've also seen errno=113 ("No route to
host"). Could this mean there's an occasional infrastructure  
problem? It
would be strange, as it would then seem that this particular run  
somehow
triggers it?.. Could these messages also mean that some messages got
lost due to these errors, and that's why the server thinks it still  
has
some results to receive while the clients think they've sent  
everything out?

  
  
That is all possible.  Sorry I missed that message in your original  
message -- it's basically a message saying that MPI_COMM_WORLD rank 0  
got a timeout from one of the peers that it shouldn't have.

You're sure that none of your processes are exiting early, right?   
You said they were all waiting in MPI_Probe, but I just wanted to  
double check that they're all still running.

Unfortunately, our error message is not very clear about which host  
it lost the connection with -- after you see that message, do you see  
incoming communications from all the slaves, or only some of them?

  






Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv

2007-09-27 Thread Daniel Rozenbaum




Here's some more info on the problem I've been struggling with; my
apologies for the lengthy posts, but I'm a little desperate here :-)

I was able to reduce the size of the experiment that reproduces the
problem, both in terms of input data size and the number of slots in
the cluster. The cluster now consists of 6 slots (5 clients), with two
of the clients running on the same node as the server and three others
on another node. This allowed me to follow Brian's
advice and run the server and all the clients under gdb and make
sure none of the processes terminates (normally or abnormally) when the
server reports the "readv failed" errors; this is indeed the case.

I then followed Jeff's
advice and added a debug loop just prior to the server calling
MPI_Waitany(), identifying the entries in the requests array which are
not
MPI_REQUEST_NULL, and then tracing back these
requests. What I found was the following:

At some point during the run, the server calls MPI_Waitany() on an
array of requests consisting of 96 elements, and gets stuck in it
forever; the only thing that happens at some point thereafter is that
the server reports a couple of "readv failed" errors:

[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110
[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110

According to my debug prints, just before that last call to
MPI_Waitany() the array requests[] contains 38 entries which are not
MPI_REQUEST_NULL. Half of these entries correspond to calls to Isend(),
half to Irecv(). Specifically, for example, entries
4,14,24,34,44,54,64,74,84,94 are used for Isend()'s from server to
client #3 (of 5), and entries 5,15,...,95 are used for Irecv() for the
same client.

I traced back what's going on, for instance, with requests[4]. As I
mentioned, it corresponds to a call to MPI_Isend() initiated by the
server to client #3 (of 5). By the time the server gets stuck in
Waitany(), this client has already correctly processed the first
Isend() from master in requests[4], returned its response in
requests[5], and the server received this response properly. After
receiving this response, the server Isend()'s the next task to this
client in requests[4], and this is correctly reflected in "requests[4]
!= MPI_REQUESTS_NULL" just before the last call to Waitany(), but for
some reason this send doesn't seem to go any further.

Looking at all other requests[] corresponding to Isend()'s initiated by
the server to the same client (14,24,...,94), they're all also not
MPI_REQUEST_NULL, and are not going any further either.

One thing that might be important is that the messages the server is
sending to the clients in my experiment are quite large, ranging from
hundreds of Kbytes to several Mbytes, the largest being around 9
Mbytes. The largest messages take place at the beginning of the run and
are processed correctly though.

Also, I ran the same experiment on another cluster that uses slightly
different
hardware and network infrastructure, and could not reproduce the
problem.

Hope at least some of the above makes some sense. Any additional advice
would be greatly appreciated!
Many thanks,
Daniel


Daniel Rozenbaum wrote:

  
  
  I'm now running the same experiment under valgrind. It's probably
going to run for a few days, but interestingly what I'm seeing now is
that while running under valgrind's memcheck, the app has been
reporting much more of these "recv failed" errors, and not only on the
server node:
  
[host1][0,1,0]
[host4][0,1,13]
[host5][0,1,18]
[host8][0,1,30]
[host10][0,1,36]
[host12][0,1,46]
  
If in the original run I got 3 such messages, in the valgrind'ed run I
got about 45 so far, and the app still has about 75% of the work left.
  
I'm checking while all this is happening, and all the client processes
are still running, none exited early.
  
I've been analyzing the debug output in my original experiment, and it
does look like the server never receives any new messages from two of
the clients after the "recv failed" messages appear. If my analysis is
correct, these two clients ran on the same host. It might be the case
then that the messages with the next tasks to execute that the server
attempted to send to these two clients never reached them, or were
never sent. Interesting though that there were two additional clients
on the same host, and those seem to have kept working all along, until
the app got stuck.
  
Once this valgrind experiment is over, I'll proceed to your other
suggestion about the debug loop on the server side checking for any of
the requests the app is waiting for being MPI_REQUEST_NULL.
  
Many thanks,
Daniel
  
  
Jeff Squyres wrote:
  
On Sep 17, 2007, at 11:26 AM, Daniel Rozenbaum wrote:

  

  What seems to be happeni

Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv

2007-09-28 Thread Daniel Rozenbaum




Good Open MPI gurus,

I've further reduced the size of the experiment that reproduces the
problem. My array of requests now has just 10 entries, and by the time
the server gets stuck in MPI_Waitany(), and three of the clients are
stuck in MPI_Recv(), the array has three unprocessed Isend()'s and
three unprocessed Irecv()'s.

I've upgraded to Open MPI 1.2.4, but this made no difference.

Are there any internal logging or debugging facilities in Open MPI that
would allow me to further track the calls that eventually result in the
error in mca_btl_tcp_frag_recv() ?

Thanks,
Daniel


Daniel Rozenbaum wrote:

  
  Here's some more info on the problem I've been struggling with;
my
apologies for the lengthy posts, but I'm a little desperate here :-)
  
I was able to reduce the size of the experiment that reproduces the
problem, both in terms of input data size and the number of slots in
the cluster. The cluster now consists of 6 slots (5 clients), with two
of the clients running on the same node as the server and three others
on another node. This allowed me to follow Brian's
advice and run the server and all the clients under gdb and make
sure none of the processes terminates (normally or abnormally) when the
server reports the "readv failed" errors; this is indeed the case.
  
I then followed Jeff's
advice and added a debug loop just prior to the server calling
MPI_Waitany(), identifying the entries in the requests array which are
not
MPI_REQUEST_NULL, and then tracing back these
requests. What I found was the following:
  
At some point during the run, the server calls MPI_Waitany() on an
array of requests consisting of 96 elements, and gets stuck in it
forever; the only thing that happens at some point thereafter is that
the server reports a couple of "readv failed" errors:
  
[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110
[host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv]
mca_btl_tcp_frag_recv: readv failed with errno=110
  
  According to my debug prints, just before that last call to
MPI_Waitany() the array requests[] contains 38 entries which are not
MPI_REQUEST_NULL. Half of these entries correspond to calls to Isend(),
half to Irecv(). Specifically, for example, entries
4,14,24,34,44,54,64,74,84,94 are used for Isend()'s from server to
client #3 (of 5), and entries 5,15,...,95 are used for Irecv() for the
same client.
  
I traced back what's going on, for instance, with requests[4]. As I
mentioned, it corresponds to a call to MPI_Isend() initiated by the
server to client #3 (of 5). By the time the server gets stuck in
Waitany(), this client has already correctly processed the first
Isend() from master in requests[4], returned its response in
requests[5], and the server received this response properly. After
receiving this response, the server Isend()'s the next task to this
client in requests[4], and this is correctly reflected in "requests[4]
!= MPI_REQUESTS_NULL" just before the last call to Waitany(), but for
some reason this send doesn't seem to go any further.
  
Looking at all other requests[] corresponding to Isend()'s initiated by
the server to the same client (14,24,...,94), they're all also not
MPI_REQUEST_NULL, and are not going any further either.
  
One thing that might be important is that the messages the server is
sending to the clients in my experiment are quite large, ranging from
hundreds of Kbytes to several Mbytes, the largest being around 9
Mbytes. The largest messages take place at the beginning of the run and
are processed correctly though.
  
Also, I ran the same experiment on another cluster that uses slightly
different
hardware and network infrastructure, and could not reproduce the
problem.
  
Hope at least some of the above makes some sense. Any additional advice
would be greatly appreciated!
Many thanks,
Daniel
  





[OMPI users] MPI_Probe succeeds, but subsequent MPI_Recv gets stuck

2007-10-03 Thread Daniel Rozenbaum




Hi again,

I'm trying to debug the problem I posted
on several times recently; I thought I'd try asking a more focused
question:

I have the following sequence in the client code:

MPI_Status stat;
  ret = MPI_Probe(0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat);
  assert(ret == MPI_SUCCESS);
  ret = MPI_Get_elements(&stat, MPI_BYTE, &count);
  assert(ret == MPI_SUCCESS);
  char *buffer = malloc(count);
  assert(buffer != NULL);
  ret = MPI_Recv((void *)buffer, count, MPI_BYTE, 0, stat.MPI_TAG,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
  assert(ret == MPI_SUCCESS);
  fprintf(stderr, "MPI_Recv done\n");
  

Each MPI_ call in the lines above is surrounded by debug prints
that print out the client's rank, current time, the action about to be
taken with all its parameters' values, and the action's result. After
the first cycle (receive message from server -- process it -- send
response -- wait for next message) works out as
expected, the next cycle get stuck in MPI_Recv. What I get in my debug
prints is more or less the following:

MPI_Probe(source= 0, tag= MPI_ANY_TAG, comm=
MPI_COMM_WORKD, status= )
  MPI_Probe done, source= 0, tag= 2, error= 0
  MPI_Get_elements(status= , dtype= MPI_BYTE,
count= )
  MPI_Get_elements done, count= 2731776
  MPI_Recv(buf= , count= 2731776, dtype= MPI_BYTE,
src= "" tag= 2, comm= MPI_COMM_WORLD, stat= MPI_STATUS_IGNORE)
  

My question then is this - what would cause MPI_Recv to not return,
after the immediately preceding MPI_Probe and MPI_Get_elements return
properly?

Thanks,
Daniel







Re: [OMPI users] MPI_Probe succeeds, but subsequent MPI_Recv gets stuck

2007-10-18 Thread Daniel Rozenbaum
Unfortunately, so far I haven't even been able to reproduce it on a 
different cluster. Since I had no success getting to the bottom of this 
problem, I've been concentrating my efforts on changing the app so that 
there's no need to send very large messages; I might be able to find 
time later to create a short example that shows the problem.


FWIW, when I was debugging it, I peeked a little into Open MPI code, and 
found that the client's MPI_Recv gets stuck in mca_pml_ob1_recv(), after 
it determines that "recvreq->req_recv.req_base.req_ompi.req_complete == 
false" and calls opal_condition_wait().


Jeff Squyres wrote:

Can you send a short test program that shows this problem, perchance?


On Oct 3, 2007, at 1:41 PM, Daniel Rozenbaum wrote:

  

Hi again,

I'm trying to debug the problem I posted on several times recently;  
I thought I'd try asking a more focused question:


I have the following sequence in the client code:
MPI_Status stat;
ret = MPI_Probe(0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat);
assert(ret == MPI_SUCCESS);
ret = MPI_Get_elements(&stat, MPI_BYTE, &count);
assert(ret == MPI_SUCCESS);
char *buffer = malloc(count);
assert(buffer != NULL);
ret = MPI_Recv((void *)buffer, count, MPI_BYTE, 0, stat.MPI_TAG,  
MPI_COMM_WORLD, MPI_STATUS_IGNORE);

assert(ret == MPI_SUCCESS);
fprintf(stderr, "MPI_Recv done\n");
server>
Each MPI_ call in the lines above is surrounded by debug prints  
that print out the client's rank, current time, the action about to  
be taken with all its parameters' values, and the action's result.  
After the first cycle (receive message from server -- process it --  
send response -- wait for next message) works out as expected, the  
next cycle get stuck in MPI_Recv. What I get in my debug prints is  
more or less the following:
MPI_Probe(source= 0, tag= MPI_ANY_TAG, comm= MPI_COMM_WORKD,  
status= )

MPI_Probe done, source= 0, tag= 2, error= 0
MPI_Get_elements(status= , dtype= MPI_BYTE, count=  
)

MPI_Get_elements done, count= 2731776
MPI_Recv(buf= , count= 2731776, dtype= MPI_BYTE, src= 0,  
tag= 2, comm= MPI_COMM_WORLD, stat= MPI_STATUS_IGNORE)
failed" errors in server's stderr>
My question then is this - what would cause MPI_Recv to not return,  
after the immediately preceding MPI_Probe and MPI_Get_elements  
return properly?


Thanks,
Daniel





Re: [OMPI users] MPI_Probe succeeds, but subsequent MPI_Recv gets stuck

2007-10-18 Thread Daniel Rozenbaum
Yes, a memory bug has been my primary focus due to the not entirely 
consistent nature of this problem; I valgrind'ed the app a number of 
times, to no avail though. Will post again if anything new comes up... 
Thanks!


Jeff Squyres wrote:
Yes, that's the normal progression.  For some reason, OMPI appears to  
have decided that it had not yet received the message.  Perhaps a  
memory bug in your application...?  Have you run it through valgrind,  
or some other memory-checking debugger, perchance?


On Oct 18, 2007, at 12:35 PM, Daniel Rozenbaum wrote:

  

Unfortunately, so far I haven't even been able to reproduce it on a
different cluster. Since I had no success getting to the bottom of  
this
problem, I've been concentrating my efforts on changing the app so  
that

there's no need to send very large messages; I might be able to find
time later to create a short example that shows the problem.

FWIW, when I was debugging it, I peeked a little into Open MPI  
code, and
found that the client's MPI_Recv gets stuck in mca_pml_ob1_recv(),  
after
it determines that "recvreq- 


req_recv.req_base.req_ompi.req_complete ==
  

false" and calls opal_condition_wait().

Jeff Squyres wrote:


Can you send a short test program that shows this problem, perchance?


On Oct 3, 2007, at 1:41 PM, Daniel Rozenbaum wrote:


  

Hi again,

I'm trying to debug the problem I posted on several times recently;
I thought I'd try asking a more focused question:

I have the following sequence in the client code:
MPI_Status stat;
ret = MPI_Probe(0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat);
assert(ret == MPI_SUCCESS);
ret = MPI_Get_elements(&stat, MPI_BYTE, &count);
assert(ret == MPI_SUCCESS);
char *buffer = malloc(count);
assert(buffer != NULL);
ret = MPI_Recv((void *)buffer, count, MPI_BYTE, 0, stat.MPI_TAG,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
assert(ret == MPI_SUCCESS);
fprintf(stderr, "MPI_Recv done\n");

Each MPI_ call in the lines above is surrounded by debug prints
that print out the client's rank, current time, the action about to
be taken with all its parameters' values, and the action's result.
After the first cycle (receive message from server -- process it --
send response -- wait for next message) works out as expected, the
next cycle get stuck in MPI_Recv. What I get in my debug prints is
more or less the following:
MPI_Probe(source= 0, tag= MPI_ANY_TAG, comm= MPI_COMM_WORKD,
status= )
MPI_Probe done, source= 0, tag= 2, error= 0
MPI_Get_elements(status= , dtype= MPI_BYTE, count=
)
MPI_Get_elements done, count= 2731776
MPI_Recv(buf= , count= 2731776, dtype= MPI_BYTE, src= 0,
tag= 2, comm= MPI_COMM_WORLD, stat= MPI_STATUS_IGNORE)

My question then is this - what would cause MPI_Recv to not return,
after the immediately preceding MPI_Probe and MPI_Get_elements
return properly?

Thanks,
Daniel