Jeff, others: I fixed the problem we were experiencing by adding a barrier. The bug occurred between a piece of code that uses (many, over a loop) SEND (from the leader) and RECV (in the worker processes) to ship data to the processing nodes from the head / leader, and I think what might have been happening is that this communication was mixed up with the following allreduce, when there's no barrier.
The bug shows up in Valgrind and dmalloc as a read from freed memory. I might spend some time trying to make a small piece of code that reproduces this, but maybe this gives you some idea of what might be the issue, if it's something that should be fixed. Some more info: it happens even as far back as openMPI 1.3.4, and even in the newest 1.6.3. Steve -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres Sent: Saturday, December 15, 2012 7:34 AM To: Open MPI Users Subject: Re: [OMPI users] Possible memory error On Dec 14, 2012, at 4:31 PM, Handerson, Steven wrote: > I'm trying to track down an instance of openMPI writing to a freed block of > memory. > This occurs with the most recent release (1.6.3) as well as 1.6, on a 64 bit > intel architecture, fedora 14. > It occurs with a very simple reduction (allreduce minimum), over a single int > value. Can you send a reproducer program? The simpler, the better. > I'm wondering if the openMPI developers use power tools such as > valgrind / dmalloc / etc on the releases to try to catch these things > via exhaustive testing - but I understand memory problems in C are of > the nature that anyone making a mistake can propogate, so I haven't ruled out > problems in our own code. > Also, I'm wondering if anyone has suggestions on how to track this down > further. Yes, we do use such tools. Can you cite the specific file/line where the problem is occurring? The all reduce algorithms are fairly self-contained; it should be (relatively) straightforward to examine that code and see if there's a problem with the memory allocation there. > I'm using allinea DDT and their builtin dmalloc, which catches the > error, which appears in the second memcpy in opal_convertor_pack(), but I > don't have more details than that at the moment. > All I know so far is that one of those values has been freed. > Obviously, I haven't seen anything in earlier parts of the code which > might have triggered memory corruption, although both openMPI and intel IPP > do things with uninitialized values before this (according to Valgrind). There's a number of issues that can lead to false positives for using uninitialized values. Here's two of the most common cases: 1. When using TCP, one of our data headers has a padding hole in it, but we write the whole struct down a TCP socket file descriptor anyway. Hence, it will generate a "read from uninit" warning. 2. When using OpenFabrics-based networks, tool like valgrind don't see the OS-bypass initialization of the memory (Which frequently comes directly from the hardware), and it generates a lot of false "read from uninit" positives. One thing you can try is to compile Open MPI --with-valgrind. This adds a little performance penalty, but we take extra steps to eliminate most false positives. It could help separate the wheat from the chaff, in your case. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users