Re: [OMPI users] invalid write in opal_generic_simple_unpack

Jeffrey Squyres Wed, 14 Mar 2012 15:43:28 -0400

On Mar 14, 2012, at 9:38 AM, Patrik Jonsson wrote:

> I'm trying to track down a spurious segmentation fault that I'm
> getting with my MPI application. I tried using valgrind, and after
> suppressing the 25,000 errors in PMPI_Init_thread and associated
> Init/Finalize functions,


I haven't looked at these in a while, but the last time I looked, many/most of 
them came from one of several sources:

- OS-bypass network mechanisms (i.e., the memory is ok, but valgrind isn't 
aware of it)
- weird optimizations from the compiler (particularly from non-gcc compilers)
- weird optimizations in glib or other support libraries
- Open MPI sometimes specifically has "holes" of uninitialized data that we 
memcpy (long story short: it can be faster to copy a large region that contains 
a hole rather than doing 2 memcopies of the fully-initialized regions)

Other than what you cited below, are you seeing others?  What version of Open 
MPI is this?  Did you --enable-valgrind when you configured Open MPI?  This can 
reduce a bunch of these kinds of warnings.

> I'm left with an uninitialized write in
> PMPI_Isend (which I saw is not unexpected), plus this:
> 
> ==11541== Thread 1:
> ==11541== Invalid write of size 1
> ==11541==    at 0x4A09C9F: _intel_fast_memcpy (mc_replace_strmem.c:650)

That doesn't seem right.  It's an *invalid* write, not an *uninitialized* 
access.  Could be serious.

> ==11541==    by 0x5093447: opal_generic_simple_unpack
> (opal_datatype_unpack.c:420)
> ==11541==    by 0x508D642: opal_convertor_unpack (opal_convertor.c:302)
> ==11541==    by 0x4F8FD1A: mca_pml_ob1_recv_frag_callback_match
> (pml_ob1_recvfrag.c:217)
> ==11541==    by 0x4ED51BD: mca_btl_tcp_endpoint_recv_handler
> (btl_tcp_endpoint.c:718)
> ==11541==    by 0x509644F: opal_event_loop (event.c:766)
> ==11541==    by 0x507FA50: opal_progress (opal_progress.c:189)
> ==11541==    by 0x4E95AFE: ompi_request_default_test (req_test.c:88)
> ==11541==    by 0x4EB8077: PMPI_Test (ptest.c:61)
> ==11541==    by 0x78C4339: boost::mpi::request::test() (in
> /n/home00/pjonsson/lib/libboost_mpi.so.1
> .48.0)

It looks like this is happening in the TCP receive handler; it received some 
data from a TCP socket and is trying to copy it to the final, MPI-specified 
receive buffer.

If you can attach the debugger here, per chance, it might be useful to verify 
that OMPI is copying to the target buffer that was assumedly specified in a 
prior call to MPI_IRECV (and also double check that this buffer is still valid).

> ==11541==    by 0x4B5DA3:
> mcrx::mpi_master<test_xfer>::process_handshakes()
> (mpi_master_impl.h:216)
> ==11541==    by 0x4B5557: mcrx::mpi_master<test_xfer>::run()
> (mpi_master_impl.h:541)
> ==11541==  Address 0x7feffb327 is just below the stack ptr.  To
> suppress, use: --workaround-gcc296-
> bugs=yes
> 
> The test in question tests for a single int being sent between the
> tasks. This is done using the Boost::MPI skeleton/content mechanism,
> and the receive is done to an element of a std::vector, so there's no
> reason it should unpack anywhere near the stack ptr. However, an int
> should be size 4.

Is there any chance that you can provide a small reproducer in C without all 
the Boost stuff?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] invalid write in opal_generic_simple_unpack

Reply via email to