On Mar 14, 2012, at 9:38 AM, Patrik Jonsson wrote: > I'm trying to track down a spurious segmentation fault that I'm > getting with my MPI application. I tried using valgrind, and after > suppressing the 25,000 errors in PMPI_Init_thread and associated > Init/Finalize functions,
I haven't looked at these in a while, but the last time I looked, many/most of them came from one of several sources: - OS-bypass network mechanisms (i.e., the memory is ok, but valgrind isn't aware of it) - weird optimizations from the compiler (particularly from non-gcc compilers) - weird optimizations in glib or other support libraries - Open MPI sometimes specifically has "holes" of uninitialized data that we memcpy (long story short: it can be faster to copy a large region that contains a hole rather than doing 2 memcopies of the fully-initialized regions) Other than what you cited below, are you seeing others? What version of Open MPI is this? Did you --enable-valgrind when you configured Open MPI? This can reduce a bunch of these kinds of warnings. > I'm left with an uninitialized write in > PMPI_Isend (which I saw is not unexpected), plus this: > > ==11541== Thread 1: > ==11541== Invalid write of size 1 > ==11541== at 0x4A09C9F: _intel_fast_memcpy (mc_replace_strmem.c:650) That doesn't seem right. It's an *invalid* write, not an *uninitialized* access. Could be serious. > ==11541== by 0x5093447: opal_generic_simple_unpack > (opal_datatype_unpack.c:420) > ==11541== by 0x508D642: opal_convertor_unpack (opal_convertor.c:302) > ==11541== by 0x4F8FD1A: mca_pml_ob1_recv_frag_callback_match > (pml_ob1_recvfrag.c:217) > ==11541== by 0x4ED51BD: mca_btl_tcp_endpoint_recv_handler > (btl_tcp_endpoint.c:718) > ==11541== by 0x509644F: opal_event_loop (event.c:766) > ==11541== by 0x507FA50: opal_progress (opal_progress.c:189) > ==11541== by 0x4E95AFE: ompi_request_default_test (req_test.c:88) > ==11541== by 0x4EB8077: PMPI_Test (ptest.c:61) > ==11541== by 0x78C4339: boost::mpi::request::test() (in > /n/home00/pjonsson/lib/libboost_mpi.so.1 > .48.0) It looks like this is happening in the TCP receive handler; it received some data from a TCP socket and is trying to copy it to the final, MPI-specified receive buffer. If you can attach the debugger here, per chance, it might be useful to verify that OMPI is copying to the target buffer that was assumedly specified in a prior call to MPI_IRECV (and also double check that this buffer is still valid). > ==11541== by 0x4B5DA3: > mcrx::mpi_master<test_xfer>::process_handshakes() > (mpi_master_impl.h:216) > ==11541== by 0x4B5557: mcrx::mpi_master<test_xfer>::run() > (mpi_master_impl.h:541) > ==11541== Address 0x7feffb327 is just below the stack ptr. To > suppress, use: --workaround-gcc296- > bugs=yes > > The test in question tests for a single int being sent between the > tasks. This is done using the Boost::MPI skeleton/content mechanism, > and the receive is done to an element of a std::vector, so there's no > reason it should unpack anywhere near the stack ptr. However, an int > should be size 4. Is there any chance that you can provide a small reproducer in C without all the Boost stuff? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/