Jeff, others:

I fixed the problem we were experiencing by adding a barrier.
The bug occurred between a piece of code that uses (many, over a loop) SEND 
(from the leader)
and RECV (in the worker processes) to ship data to the 
processing nodes from the head / leader, and I think what might have been 
happening is
that this communication was mixed up with the following allreduce, when there's 
no barrier.

The bug shows up in Valgrind and dmalloc as a read from freed memory.

I might spend some time trying to make a small piece of code that reproduces 
this,
but maybe this gives you some idea of what might be the issue,
if it's something that should be fixed.
Some more info: it happens even as far back as openMPI 1.3.4, and even in the 
newest 1.6.3.

Steve



-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Jeff Squyres
Sent: Saturday, December 15, 2012 7:34 AM
To: Open MPI Users
Subject: Re: [OMPI users] Possible memory error

On Dec 14, 2012, at 4:31 PM, Handerson, Steven wrote:

> I'm trying to track down an instance of openMPI writing to a freed block of 
> memory.
> This occurs with the most recent release (1.6.3) as well as 1.6, on a 64 bit 
> intel architecture, fedora 14.
> It occurs with a very simple reduction (allreduce minimum), over a single int 
> value.

Can you send a reproducer program?  The simpler, the better.

> I'm wondering if the openMPI developers use power tools such as 
> valgrind / dmalloc / etc on the releases to try to catch these things 
> via exhaustive testing - but I understand memory problems in C are of 
> the nature that anyone making a mistake can propogate, so I haven't ruled out 
> problems in our own code.
> Also, I'm wondering if anyone has suggestions on how to track this down 
> further.

Yes, we do use such tools.

Can you cite the specific file/line where the problem is occurring?  The all 
reduce algorithms are fairly self-contained; it should be (relatively) 
straightforward to examine that code and see if there's a problem with the 
memory allocation there.

> I'm using allinea DDT and their builtin dmalloc, which catches the 
> error, which appears in the second memcpy in  opal_convertor_pack(), but I 
> don't have more details than that at the moment.
> All I know so far is that one of those values has been freed.
> Obviously, I haven't seen anything in earlier parts of the code which 
> might have triggered memory corruption, although both openMPI and intel IPP 
> do things with uninitialized values before this (according to Valgrind).

There's a number of issues that can lead to false positives for using 
uninitialized values.  Here's two of the most common cases:

1. When using TCP, one of our data headers has a padding hole in it, but we 
write the whole struct down a TCP socket file descriptor anyway.  Hence, it 
will generate a "read from uninit" warning.

2. When using OpenFabrics-based networks, tool like valgrind don't see the 
OS-bypass initialization of the memory (Which frequently comes directly from 
the hardware), and it generates a lot of false "read from uninit" positives.

One thing you can try is to compile Open MPI --with-valgrind.  This adds a 
little performance penalty, but we take extra steps to eliminate most false 
positives.  It could help separate the wheat from the chaff, in your case.

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to