I'm now running the same experiment under valgrind. It's probably
going to run for a few days, but interestingly what I'm seeing now is
that while running under valgrind's memcheck, the app has been
reporting much more of these "recv failed" errors, and not only on the
server node: [host1][0,1,0] [host4][0,1,13] [host5][0,1,18] [host8][0,1,30] [host10][0,1,36] [host12][0,1,46] If in the original run I got 3 such messages, in the valgrind'ed run I got about 45 so far, and the app still has about 75% of the work left. I'm checking while all this is happening, and all the client processes are still running, none exited early. I've been analyzing the debug output in my original experiment, and it does look like the server never receives any new messages from two of the clients after the "recv failed" messages appear. If my analysis is correct, these two clients ran on the same host. It might be the case then that the messages with the next tasks to execute that the server attempted to send to these two clients never reached them, or were never sent. Interesting though that there were two additional clients on the same host, and those seem to have kept working all along, until the app got stuck. Once this valgrind experiment is over, I'll proceed to your other suggestion about the debug loop on the server side checking for any of the requests the app is waiting for being MPI_REQUEST_NULL. Many thanks, Daniel Jeff Squyres wrote: On Sep 17, 2007, at 11:26 AM, Daniel Rozenbaum wrote:What seems to be happening is this: the code of the server is written in such a manner that the server knows how many "responses" it's supposed to receive from all the clients, so when all the calculation tasks have been distributed, the server enters a loop inside which it calls MPI_Waitany on an array of handles until it receives all the results it expects. However, from my debug prints it looks like all the clients think they've sent all the results they could, and they're now all sitting in MPI_Probe, waiting for the server to send out the next instruction (which is supposed to contain a message indicating the end of the run). So, the server is stuck in MPI_Waitany() while all the clients are stuck in MPI_Probe().On the server side, try putting in a debug loop and see if any of the requests that your app is waiting for are not MPI_REQUEST_NULL (it's not a value of 0 -- you'll need to compare against MPI_REQUEST_NULL). If there are any, see if you can trace backwards to see what request it is.I was wondering if you could comment on the "readv failed" messages I'm seeing in the server's stderr: [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 I'm seeing a few of these along the server's run, with errno=110 ("Connection timed out" according to the "perl -e 'die$!=errno'" method I found in OpenMPI FAQs), and I've also seen errno=113 ("No route to host"). Could this mean there's an occasional infrastructure problem? It would be strange, as it would then seem that this particular run somehow triggers it?.. Could these messages also mean that some messages got lost due to these errors, and that's why the server thinks it still has some results to receive while the clients think they've sent everything out?That is all possible. Sorry I missed that message in your original message -- it's basically a message saying that MPI_COMM_WORLD rank 0 got a timeout from one of the peers that it shouldn't have. You're sure that none of your processes are exiting early, right? You said they were all waiting in MPI_Probe, but I just wanted to double check that they're all still running. Unfortunately, our error message is not very clear about which host it lost the connection with -- after you see that message, do you see incoming communications from all the slaves, or only some of them? |
- [OMPI users] Application using OpenMPI 1.2.3 hangs, error... Daniel Rozenbaum
- Re: [OMPI users] Application using OpenMPI 1.2.3 han... Jeff Squyres
- Re: [OMPI users] Application using OpenMPI 1.2.3... Daniel Rozenbaum
- Re: [OMPI users] Application using OpenMPI 1... Jeff Squyres
- Re: [OMPI users] Application using OpenM... Daniel Rozenbaum
- Re: [OMPI users] Application using ... Daniel Rozenbaum
- Re: [OMPI users] Application us... Daniel Rozenbaum