[OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv
Hello, I'm working on an MPI application for which I recently started using Open MPI instead of LAM/MPI. Both with Open MPI and LAM/MPI it mostly runs ok, but there're a number of cases under which the application terminates abnormally when using LAM/MPI, and hangs when using Open MPI. I haven't been able to reduce the example reproducing the problem, so every time it takes about an hour of running time before the application hangs. It hangs right before it's supposed to end properly. The master and all the slave processes are showing in "top" consuming 100% CPU. The application just hangs there like that until I interrupt it. Here's the command line: orterun --prefix /path/to/openmpi -mca btl tcp,self -x PATH -x LD_LIBRARY_PATH --hostfile hostfile1 /path/to/app_executable hostfile1: host1 slots=3 host2 slots=4 host3 slots=4 host4 slots=4 host5 slots=4 host6 slots=4 host7 slots=4 host8 slots=4 host9 slots=4 host10 slots=4 host11 slots=4 host12 slots=4 host13 slots=4 host14 slots=4 Each host is a dual-CPU dual-core Intel box running Red Hat Enterprise Server 4. I caught the following error messages on app's stderr during the run: [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 [host8][0,1,29][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=113 [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 Excerpts from strace output, and ompi_info are attached below. Any advice would be greatly appreciated! Thanks in advance, Daniel strace on the orterun process: poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=0, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=34, events=POLLIN}, {fd=33, events=POLLIN}, {fd=32, events=POLLIN}, {fd=35, events=POLLIN}, ...], 71, 1000) = 0 rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0 rt_sigaction(SIGCHLD, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 rt_sigaction(SIGTERM, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 rt_sigaction(SIGINT, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 rt_sigaction(SIGUSR1, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 rt_sigaction(SIGUSR2, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 sched_yield() = 0 rt_sigprocmask(SIG_BLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0 rt_sigaction(SIGCHLD, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 rt_sigaction(SIGTERM, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 rt_sigaction(SIGINT, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 rt_sigaction(SIGUSR1, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 rt_sigaction(SIGUSR2, {0x2a956c7e70, [INT USR1 USR2 TERM CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [INT USR1 USR2 TERM CHLD], NULL, 8) = 0 poll([{fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=5, events=POLL strace on the master process: rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 rt_sigaction(SIGCHLD, {0x2a972cae70, [CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0 rt_sigaction(SIGCHLD, {0x2a972cae70, [CHLD], SA_RESTORER|SA_RESTART, 0x3fdf80c4f0}, NULL, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=14, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=16, events=POLLIN}, {fd=15, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}, {fd=67, events=POLLIN}, {fd=25, events=POLLIN}, {fd=66, events=POLLIN}, {fd=26, events=POLLIN}, {fd=27, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLIN}, {fd=30, events=POLLIN}, {fd=31, events=POLLIN}, {fd=32, events=POLLIN}, {fd=33, events=POLLIN}, {fd=34, events=POLLIN}, {fd=35, events=POLLIN}, {fd=36, events=POLLIN}, {
Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv
Jeff, thanks a lot for taking the time, I looked into this some more and this could very well be a side effect of a problem in my code, maybe a memory violation that messes things up; I'm going to valgrind this thing and see what comes up. Most of the time the app runs just fine, so I'm not sure if it could also be a problem in the MPI messages logic in my code; could be though. What seems to be happening is this: the code of the server is written in such a manner that the server knows how many "responses" it's supposed to receive from all the clients, so when all the calculation tasks have been distributed, the server enters a loop inside which it calls MPI_Waitany on an array of handles until it receives all the results it expects. However, from my debug prints it looks like all the clients think they've sent all the results they could, and they're now all sitting in MPI_Probe, waiting for the server to send out the next instruction (which is supposed to contain a message indicating the end of the run). So, the server is stuck in MPI_Waitany() while all the clients are stuck in MPI_Probe(). I was wondering if you could comment on the "readv failed" messages I'm seeing in the server's stderr: [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 I'm seeing a few of these along the server's run, with errno=110 ("Connection timed out" according to the "perl -e 'die$!=errno'" method I found in OpenMPI FAQs), and I've also seen errno=113 ("No route to host"). Could this mean there's an occasional infrastructure problem? It would be strange, as it would then seem that this particular run somehow triggers it?.. Could these messages also mean that some messages got lost due to these errors, and that's why the server thinks it still has some results to receive while the clients think they've sent everything out? Many thanks, Daniel Jeff Squyres wrote: It sounds like we have a missed corner case of the OMPI run-time not cleaning properly. I know one case like this came up recently (if an app calls exit() without calling MPI_FINALIZE, OMPI v1.2.x hangs) and Ralph is working on it. This could well be what is happening here...? Do you know how your process is exiting? If a process dies via signal, OMPI *should* be seeing that and cleaning up the whole job properly. On Sep 12, 2007, at 10:50 PM, Daniel Rozenbaum wrote: Hello, I'm working on an MPI application for which I recently started using Open MPI instead of LAM/MPI. Both with Open MPI and LAM/MPI it mostly runs ok, but there're a number of cases under which the application terminates abnormally when using LAM/MPI, and hangs when using Open MPI. I haven't been able to reduce the example reproducing the problem, so every time it takes about an hour of running time before the application hangs. It hangs right before it's supposed to end properly. The master and all the slave processes are showing in "top" consuming 100% CPU. The application just hangs there like that until I interrupt it. Here's the command line: orterun --prefix /path/to/openmpi -mca btl tcp,self -x PATH -x LD_LIBRARY_PATH --hostfile hostfile1 /path/to/app_executable params> hostfile1: host1 slots=3 host2 slots=4 host3 slots=4 host4 slots=4 host5 slots=4 host6 slots=4 host7 slots=4 host8 slots=4 host9 slots=4 host10 slots=4 host11 slots=4 host12 slots=4 host13 slots=4 host14 slots=4 Each host is a dual-CPU dual-core Intel box running Red Hat Enterprise Server 4. I caught the following error messages on app's stderr during the run: [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 [host8][0,1,29][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=113 [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 Excerpts from strace output, and ompi_info are attached below. Any advice would be greatly appreciated! Thanks in advance, Daniel ompi_info --all: Open MPI: 1.2.3 Open MPI SVN revision: r15136 Open RTE: 1.2.3 Open RTE SVN revision: r15136 OPAL: 1.2.3 OPAL SVN revision: r15136 MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.3) MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.3) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.3) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.3) MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.2.3) MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.3) MCA installdirs: env (MCA v1.
Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv
I'm now running the same experiment under valgrind. It's probably going to run for a few days, but interestingly what I'm seeing now is that while running under valgrind's memcheck, the app has been reporting much more of these "recv failed" errors, and not only on the server node: [host1][0,1,0] [host4][0,1,13] [host5][0,1,18] [host8][0,1,30] [host10][0,1,36] [host12][0,1,46] If in the original run I got 3 such messages, in the valgrind'ed run I got about 45 so far, and the app still has about 75% of the work left. I'm checking while all this is happening, and all the client processes are still running, none exited early. I've been analyzing the debug output in my original experiment, and it does look like the server never receives any new messages from two of the clients after the "recv failed" messages appear. If my analysis is correct, these two clients ran on the same host. It might be the case then that the messages with the next tasks to execute that the server attempted to send to these two clients never reached them, or were never sent. Interesting though that there were two additional clients on the same host, and those seem to have kept working all along, until the app got stuck. Once this valgrind experiment is over, I'll proceed to your other suggestion about the debug loop on the server side checking for any of the requests the app is waiting for being MPI_REQUEST_NULL. Many thanks, Daniel Jeff Squyres wrote: On Sep 17, 2007, at 11:26 AM, Daniel Rozenbaum wrote: What seems to be happening is this: the code of the server is written in such a manner that the server knows how many "responses" it's supposed to receive from all the clients, so when all the calculation tasks have been distributed, the server enters a loop inside which it calls MPI_Waitany on an array of handles until it receives all the results it expects. However, from my debug prints it looks like all the clients think they've sent all the results they could, and they're now all sitting in MPI_Probe, waiting for the server to send out the next instruction (which is supposed to contain a message indicating the end of the run). So, the server is stuck in MPI_Waitany() while all the clients are stuck in MPI_Probe(). On the server side, try putting in a debug loop and see if any of the requests that your app is waiting for are not MPI_REQUEST_NULL (it's not a value of 0 -- you'll need to compare against MPI_REQUEST_NULL). If there are any, see if you can trace backwards to see what request it is. I was wondering if you could comment on the "readv failed" messages I'm seeing in the server's stderr: [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 I'm seeing a few of these along the server's run, with errno=110 ("Connection timed out" according to the "perl -e 'die$!=errno'" method I found in OpenMPI FAQs), and I've also seen errno=113 ("No route to host"). Could this mean there's an occasional infrastructure problem? It would be strange, as it would then seem that this particular run somehow triggers it?.. Could these messages also mean that some messages got lost due to these errors, and that's why the server thinks it still has some results to receive while the clients think they've sent everything out? That is all possible. Sorry I missed that message in your original message -- it's basically a message saying that MPI_COMM_WORLD rank 0 got a timeout from one of the peers that it shouldn't have. You're sure that none of your processes are exiting early, right? You said they were all waiting in MPI_Probe, but I just wanted to double check that they're all still running. Unfortunately, our error message is not very clear about which host it lost the connection with -- after you see that message, do you see incoming communications from all the slaves, or only some of them?
Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv
Here's some more info on the problem I've been struggling with; my apologies for the lengthy posts, but I'm a little desperate here :-) I was able to reduce the size of the experiment that reproduces the problem, both in terms of input data size and the number of slots in the cluster. The cluster now consists of 6 slots (5 clients), with two of the clients running on the same node as the server and three others on another node. This allowed me to follow Brian's advice and run the server and all the clients under gdb and make sure none of the processes terminates (normally or abnormally) when the server reports the "readv failed" errors; this is indeed the case. I then followed Jeff's advice and added a debug loop just prior to the server calling MPI_Waitany(), identifying the entries in the requests array which are not MPI_REQUEST_NULL, and then tracing back these requests. What I found was the following: At some point during the run, the server calls MPI_Waitany() on an array of requests consisting of 96 elements, and gets stuck in it forever; the only thing that happens at some point thereafter is that the server reports a couple of "readv failed" errors: [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 According to my debug prints, just before that last call to MPI_Waitany() the array requests[] contains 38 entries which are not MPI_REQUEST_NULL. Half of these entries correspond to calls to Isend(), half to Irecv(). Specifically, for example, entries 4,14,24,34,44,54,64,74,84,94 are used for Isend()'s from server to client #3 (of 5), and entries 5,15,...,95 are used for Irecv() for the same client. I traced back what's going on, for instance, with requests[4]. As I mentioned, it corresponds to a call to MPI_Isend() initiated by the server to client #3 (of 5). By the time the server gets stuck in Waitany(), this client has already correctly processed the first Isend() from master in requests[4], returned its response in requests[5], and the server received this response properly. After receiving this response, the server Isend()'s the next task to this client in requests[4], and this is correctly reflected in "requests[4] != MPI_REQUESTS_NULL" just before the last call to Waitany(), but for some reason this send doesn't seem to go any further. Looking at all other requests[] corresponding to Isend()'s initiated by the server to the same client (14,24,...,94), they're all also not MPI_REQUEST_NULL, and are not going any further either. One thing that might be important is that the messages the server is sending to the clients in my experiment are quite large, ranging from hundreds of Kbytes to several Mbytes, the largest being around 9 Mbytes. The largest messages take place at the beginning of the run and are processed correctly though. Also, I ran the same experiment on another cluster that uses slightly different hardware and network infrastructure, and could not reproduce the problem. Hope at least some of the above makes some sense. Any additional advice would be greatly appreciated! Many thanks, Daniel Daniel Rozenbaum wrote: I'm now running the same experiment under valgrind. It's probably going to run for a few days, but interestingly what I'm seeing now is that while running under valgrind's memcheck, the app has been reporting much more of these "recv failed" errors, and not only on the server node: [host1][0,1,0] [host4][0,1,13] [host5][0,1,18] [host8][0,1,30] [host10][0,1,36] [host12][0,1,46] If in the original run I got 3 such messages, in the valgrind'ed run I got about 45 so far, and the app still has about 75% of the work left. I'm checking while all this is happening, and all the client processes are still running, none exited early. I've been analyzing the debug output in my original experiment, and it does look like the server never receives any new messages from two of the clients after the "recv failed" messages appear. If my analysis is correct, these two clients ran on the same host. It might be the case then that the messages with the next tasks to execute that the server attempted to send to these two clients never reached them, or were never sent. Interesting though that there were two additional clients on the same host, and those seem to have kept working all along, until the app got stuck. Once this valgrind experiment is over, I'll proceed to your other suggestion about the debug loop on the server side checking for any of the requests the app is waiting for being MPI_REQUEST_NULL. Many thanks, Daniel Jeff Squyres wrote: On Sep 17, 2007, at 11:26 AM, Daniel Rozenbaum wrote: What seems to be happeni
Re: [OMPI users] Application using OpenMPI 1.2.3 hangs, error messages in mca_btl_tcp_frag_recv
Good Open MPI gurus, I've further reduced the size of the experiment that reproduces the problem. My array of requests now has just 10 entries, and by the time the server gets stuck in MPI_Waitany(), and three of the clients are stuck in MPI_Recv(), the array has three unprocessed Isend()'s and three unprocessed Irecv()'s. I've upgraded to Open MPI 1.2.4, but this made no difference. Are there any internal logging or debugging facilities in Open MPI that would allow me to further track the calls that eventually result in the error in mca_btl_tcp_frag_recv() ? Thanks, Daniel Daniel Rozenbaum wrote: Here's some more info on the problem I've been struggling with; my apologies for the lengthy posts, but I'm a little desperate here :-) I was able to reduce the size of the experiment that reproduces the problem, both in terms of input data size and the number of slots in the cluster. The cluster now consists of 6 slots (5 clients), with two of the clients running on the same node as the server and three others on another node. This allowed me to follow Brian's advice and run the server and all the clients under gdb and make sure none of the processes terminates (normally or abnormally) when the server reports the "readv failed" errors; this is indeed the case. I then followed Jeff's advice and added a debug loop just prior to the server calling MPI_Waitany(), identifying the entries in the requests array which are not MPI_REQUEST_NULL, and then tracing back these requests. What I found was the following: At some point during the run, the server calls MPI_Waitany() on an array of requests consisting of 96 elements, and gets stuck in it forever; the only thing that happens at some point thereafter is that the server reports a couple of "readv failed" errors: [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 According to my debug prints, just before that last call to MPI_Waitany() the array requests[] contains 38 entries which are not MPI_REQUEST_NULL. Half of these entries correspond to calls to Isend(), half to Irecv(). Specifically, for example, entries 4,14,24,34,44,54,64,74,84,94 are used for Isend()'s from server to client #3 (of 5), and entries 5,15,...,95 are used for Irecv() for the same client. I traced back what's going on, for instance, with requests[4]. As I mentioned, it corresponds to a call to MPI_Isend() initiated by the server to client #3 (of 5). By the time the server gets stuck in Waitany(), this client has already correctly processed the first Isend() from master in requests[4], returned its response in requests[5], and the server received this response properly. After receiving this response, the server Isend()'s the next task to this client in requests[4], and this is correctly reflected in "requests[4] != MPI_REQUESTS_NULL" just before the last call to Waitany(), but for some reason this send doesn't seem to go any further. Looking at all other requests[] corresponding to Isend()'s initiated by the server to the same client (14,24,...,94), they're all also not MPI_REQUEST_NULL, and are not going any further either. One thing that might be important is that the messages the server is sending to the clients in my experiment are quite large, ranging from hundreds of Kbytes to several Mbytes, the largest being around 9 Mbytes. The largest messages take place at the beginning of the run and are processed correctly though. Also, I ran the same experiment on another cluster that uses slightly different hardware and network infrastructure, and could not reproduce the problem. Hope at least some of the above makes some sense. Any additional advice would be greatly appreciated! Many thanks, Daniel
[OMPI users] MPI_Probe succeeds, but subsequent MPI_Recv gets stuck
Hi again, I'm trying to debug the problem I posted on several times recently; I thought I'd try asking a more focused question: I have the following sequence in the client code: MPI_Status stat; ret = MPI_Probe(0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); assert(ret == MPI_SUCCESS); ret = MPI_Get_elements(&stat, MPI_BYTE, &count); assert(ret == MPI_SUCCESS); char *buffer = malloc(count); assert(buffer != NULL); ret = MPI_Recv((void *)buffer, count, MPI_BYTE, 0, stat.MPI_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE); assert(ret == MPI_SUCCESS); fprintf(stderr, "MPI_Recv done\n"); Each MPI_ call in the lines above is surrounded by debug prints that print out the client's rank, current time, the action about to be taken with all its parameters' values, and the action's result. After the first cycle (receive message from server -- process it -- send response -- wait for next message) works out as expected, the next cycle get stuck in MPI_Recv. What I get in my debug prints is more or less the following: MPI_Probe(source= 0, tag= MPI_ANY_TAG, comm= MPI_COMM_WORKD, status= ) MPI_Probe done, source= 0, tag= 2, error= 0 MPI_Get_elements(status= , dtype= MPI_BYTE, count= ) MPI_Get_elements done, count= 2731776 MPI_Recv(buf= , count= 2731776, dtype= MPI_BYTE, src= "" tag= 2, comm= MPI_COMM_WORLD, stat= MPI_STATUS_IGNORE) My question then is this - what would cause MPI_Recv to not return, after the immediately preceding MPI_Probe and MPI_Get_elements return properly? Thanks, Daniel
Re: [OMPI users] MPI_Probe succeeds, but subsequent MPI_Recv gets stuck
Unfortunately, so far I haven't even been able to reproduce it on a different cluster. Since I had no success getting to the bottom of this problem, I've been concentrating my efforts on changing the app so that there's no need to send very large messages; I might be able to find time later to create a short example that shows the problem. FWIW, when I was debugging it, I peeked a little into Open MPI code, and found that the client's MPI_Recv gets stuck in mca_pml_ob1_recv(), after it determines that "recvreq->req_recv.req_base.req_ompi.req_complete == false" and calls opal_condition_wait(). Jeff Squyres wrote: Can you send a short test program that shows this problem, perchance? On Oct 3, 2007, at 1:41 PM, Daniel Rozenbaum wrote: Hi again, I'm trying to debug the problem I posted on several times recently; I thought I'd try asking a more focused question: I have the following sequence in the client code: MPI_Status stat; ret = MPI_Probe(0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); assert(ret == MPI_SUCCESS); ret = MPI_Get_elements(&stat, MPI_BYTE, &count); assert(ret == MPI_SUCCESS); char *buffer = malloc(count); assert(buffer != NULL); ret = MPI_Recv((void *)buffer, count, MPI_BYTE, 0, stat.MPI_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE); assert(ret == MPI_SUCCESS); fprintf(stderr, "MPI_Recv done\n"); server> Each MPI_ call in the lines above is surrounded by debug prints that print out the client's rank, current time, the action about to be taken with all its parameters' values, and the action's result. After the first cycle (receive message from server -- process it -- send response -- wait for next message) works out as expected, the next cycle get stuck in MPI_Recv. What I get in my debug prints is more or less the following: MPI_Probe(source= 0, tag= MPI_ANY_TAG, comm= MPI_COMM_WORKD, status= ) MPI_Probe done, source= 0, tag= 2, error= 0 MPI_Get_elements(status= , dtype= MPI_BYTE, count= ) MPI_Get_elements done, count= 2731776 MPI_Recv(buf= , count= 2731776, dtype= MPI_BYTE, src= 0, tag= 2, comm= MPI_COMM_WORLD, stat= MPI_STATUS_IGNORE) failed" errors in server's stderr> My question then is this - what would cause MPI_Recv to not return, after the immediately preceding MPI_Probe and MPI_Get_elements return properly? Thanks, Daniel
Re: [OMPI users] MPI_Probe succeeds, but subsequent MPI_Recv gets stuck
Yes, a memory bug has been my primary focus due to the not entirely consistent nature of this problem; I valgrind'ed the app a number of times, to no avail though. Will post again if anything new comes up... Thanks! Jeff Squyres wrote: Yes, that's the normal progression. For some reason, OMPI appears to have decided that it had not yet received the message. Perhaps a memory bug in your application...? Have you run it through valgrind, or some other memory-checking debugger, perchance? On Oct 18, 2007, at 12:35 PM, Daniel Rozenbaum wrote: Unfortunately, so far I haven't even been able to reproduce it on a different cluster. Since I had no success getting to the bottom of this problem, I've been concentrating my efforts on changing the app so that there's no need to send very large messages; I might be able to find time later to create a short example that shows the problem. FWIW, when I was debugging it, I peeked a little into Open MPI code, and found that the client's MPI_Recv gets stuck in mca_pml_ob1_recv(), after it determines that "recvreq- req_recv.req_base.req_ompi.req_complete == false" and calls opal_condition_wait(). Jeff Squyres wrote: Can you send a short test program that shows this problem, perchance? On Oct 3, 2007, at 1:41 PM, Daniel Rozenbaum wrote: Hi again, I'm trying to debug the problem I posted on several times recently; I thought I'd try asking a more focused question: I have the following sequence in the client code: MPI_Status stat; ret = MPI_Probe(0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); assert(ret == MPI_SUCCESS); ret = MPI_Get_elements(&stat, MPI_BYTE, &count); assert(ret == MPI_SUCCESS); char *buffer = malloc(count); assert(buffer != NULL); ret = MPI_Recv((void *)buffer, count, MPI_BYTE, 0, stat.MPI_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE); assert(ret == MPI_SUCCESS); fprintf(stderr, "MPI_Recv done\n"); Each MPI_ call in the lines above is surrounded by debug prints that print out the client's rank, current time, the action about to be taken with all its parameters' values, and the action's result. After the first cycle (receive message from server -- process it -- send response -- wait for next message) works out as expected, the next cycle get stuck in MPI_Recv. What I get in my debug prints is more or less the following: MPI_Probe(source= 0, tag= MPI_ANY_TAG, comm= MPI_COMM_WORKD, status= ) MPI_Probe done, source= 0, tag= 2, error= 0 MPI_Get_elements(status= , dtype= MPI_BYTE, count= ) MPI_Get_elements done, count= 2731776 MPI_Recv(buf= , count= 2731776, dtype= MPI_BYTE, src= 0, tag= 2, comm= MPI_COMM_WORLD, stat= MPI_STATUS_IGNORE) My question then is this - what would cause MPI_Recv to not return, after the immediately preceding MPI_Probe and MPI_Get_elements return properly? Thanks, Daniel