Shaun,
These all look like fine suggestions. Another tool you should consider using for this problem or others like it in the future is TotalView. It seems like there are two related questions in your current troubleshooting scenario: 1. is the memory being used where you think it is? 2. is there really an imbalance between send/receives that is clogging the unexpected queue? I'd fire up the application under TotalView with memory debugging enabled (one of the checkboxes that will be right there when you start the debugger). Once you have run to the point where you are seeing the memory imbalance (and you don't have to wait for it to get "bad" it can just be "noticeable"). Then you want to stop all the processes by clicking stop. Then open the memory debugging window from the "debug" menu item. Then check the "memory statistics" view to make sure that you know which MPI process it is that is using more memory than the others. Is the difference in the "heap memory"? I'm guessing it will be, but I suppose there is always the possibility I'm wrong so it is good to check. The memory statistics view should show different kinds of memory. Then select the process that is using more memory (we can call it the process of interest) and run a "heap status" report. This should tell you "where" your memory usage is coming from in your program. You should get stack backtraces for all the allocations. Depending on the magnitude of the memory usage it may "pop right out" in the numbers or you might have to dig a bit. I'm not sure exactly what the backtrace of the kind of memory allocation you are talking about would look like.. One great way to pick up on more "subtle" allocations is to compare the memory usage of a process that is behaving correctly and the process that is behaving incorrectly. You can do that by selecting two processes and doing a "memory comparison" -- that will basically filter all the allocations "out of view" that are "the same (in terms of backtrace)" between the two processes. If you have several hundred extra allocations from the OpenMPI runtime on the one processes they should be easier to find in the difference view. If the two processes have other differences you'll get a longer list but if you know your code you'll hopefully be able to quickly eliminate the ones that are "expected differences". It sounds like you have a strong working hypothesis. However, it might be useful to run a memory leak check on the process of interest.. as that is another common way to get a process that starts taking up a lot of extra memory. If your working hypothesis is correct your process of interest should come back "clean" in terms of leaks. Another technique that TotalView will give you the ability to bring to bear is inspection of the MPI message queues. This can be done, again, while the processes are stopped once the memory imbalance is "noticeable". Click on the tools menu and select "message queue graph". That should bring up a graphical display of the state of the MPI message queues in all of your MPI processes. If your hypothesis is correct there should be an extremely large number of unexpected messages shown for your process of interest. One of the nice things about this view, when compared to the MPI tracing tools mentioned previously, is that it will only show you the messages which are in the queues at the point in time where you paused all the MPI tasks.. which may be a lot of messages, but it is likely to be many orders of magnitude lower than the number of MPI messages displayed on the trace. TV is commercial but a 15 day evaluation license can be obtained here http://www.totalviewtech.com/download/index.html 5 minute Videos on Memory debugging and MPI debugging (that go over some, but probably not all of the things that I discussed above) are available here http://www.totalviewtech.com/support/videos.html#0 Don't hesitate to contact me if you want help, the guys at "supp...@totalviewtech.com " can also help and are available during a product evaluation. Oh, and I should mention that there is a free version of TotalView available for students. :) Cheers, Chris Chris Gottbrath, 508-652-7735 or 774-270-3155 Director of Product Management, TotalView Technologies chris.gottbr...@totalviewtech.com -- Learn how to radically simplify your debugging: http://www.totalviewtech.com/support/white_papers.html?id=163 On Apr 14, 2009, at 4:54 PM, Eugene Loh wrote: > Shaun Jackman wrote: > >> Eugene Loh wrote: >> >>>>> On the other hand, I assume the memory imbalance we're talking >>>>> about is rather severe. Much more than 2500 bytes to be >>>>> noticeable, I would think. Is that really the situation you're >>>>> imagining? >>>> >>>> The memory imbalance is drastic. I'm expecting 2 GB of memory use >>>> per process. The heaving processes (13/16) use the expected >>>> amount of memory; the remainder (3/16) misbehaving processes use >>>> more than twice as much memory. The specifics vary from run to >>>> run of course. So, yes, there is gigs of unexpected memory use to >>>> track down. >>> >>> Umm, how big of a message imbalance do you think you might have? >>> (The inflection in my voice doesn't come out well in e-mail.) >>> Anyhow, that sounds like, um, "lots" of 2500-byte messages. >> >> The message imbalance could be very large. Each process is running >> pretty close to its memory capacity. If a backlog of messages >> causes a buffer to grow to the point where the process starts >> swapping, it will very quickly fall very far behind. There are some >> billion 25-byte operations being sent in total or tens of millions >> MPI_Send messages (at 100 operations per MPI_Send). > > Okay. Attached is a "little" note I wrote up illustrating memory > profiling with Sun tools. (It's "big" because I ended up including > a few screenshots.) The program has a bunch of one-way message > traffic and some user-code memory allocation. I then rerun with the > receiver sleeping before jumping into action. The messages back up > and OMPI ends up allocating a bunch of memory. The tools show you > who (user or OMPI) is allocating how much memory and how big of a > message backlog develops and how the sender starts stalling out > (which is a good thing!). Anyhow, a useful exercise for me and > hopefully helpful for you. > <memory- > profiling.tar.gz>_______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ********** This transmission contains confidential and/or legally privileged information from TotalView Technologies intended only for the use of the individual(s) to which it is addressed. If you are not the intended recipient, you are hereby notified that any disclosure, copying or distribution of this information or the taking of any action in reliance on the contents of this transmission is strictly prohibited. If you have received this transmission in error, please notify us immediately. ********** ********** This transmission contains confidential and/or legally privileged information from TotalView Technologies intended only for the use of the individual(s) to which it is addressed. If you are not the intended recipient, you are hereby notified that any disclosure, copying or distribution of this information or the taking of any action in reliance on the contents of this transmission is strictly prohibited. If you have received this transmission in error, please notify us immediately. **********