[OMPI users] Problem including C MPI code from C++ using C linkage
Hi all, I'm have a C MPI code that I need to link into my C++ code. As usual, from my C++ code, I do extern "C" { #include "c-code.h" } where c-code.h includes, among other things, mpi.h. This doesn't work, because it appears mpi.h tries to detect whether it's being compiled as C or C++ and includes mpicxx.h if the language is C++. The problem is that that doesn't work in C linkage, so the compilation dies with errors like: mpic++ -I. -I$HOME/include/libPJutil -I$HOME/code/arepo -m32 arepotest.cc -I$HOME/include -I/sw/include -L/sw/lib -L$HOME/code/arepo -larepo -lhdf5 -lgsl -lgmp -lmpi In file included from /usr/include/c++/4.2.1/map:65, from /sw/include/openmpi/ompi/mpi/cxx/mpicxx.h:36, from /sw/include/mpi.h:1886, from /Users/patrik/code/arepo/allvars.h:23, from /Users/patrik/code/arepo/proto.h:2, from arepo_grid.h:36, from arepotest.cc:3: /usr/include/c++/4.2.1/bits/stl_tree.h:134: error: template with C linkage /usr/include/c++/4.2.1/bits/stl_tree.h:145: error: declaration of C function 'const std::_Rb_tree_node_base* std::_Rb_tree_increment(const std::_Rb_tree_node_base*)' conflicts with /usr/include/c++/4.2.1/bits/stl_tree.h:142: error: previous declaration 'std::_Rb_tree_node_base* std::_Rb_tree_increment(std::_Rb_tree_node_base*)' here /usr/include/c++/4.2.1/bits/stl_tree.h:151: error: declaration of C function 'const std::_Rb_tree_node_base* std::_Rb_tree_decrement(const std::_Rb_tree_node_base*)' conflicts with /usr/include/c++/4.2.1/bits/stl_tree.h:148: error: previous declaration 'std::_Rb_tree_node_base* std::_Rb_tree_decrement(std::_Rb_tree_node_base*)' here /usr/include/c++/4.2.1/bits/stl_tree.h:153: error: template with C linkage /usr/include/c++/4.2.1/bits/stl_tree.h:223: error: template with C linkage /usr/include/c++/4.2.1/bits/stl_tree.h:298: error: template with C linkage /usr/include/c++/4.2.1/bits/stl_tree.h:304: error: template with C linkage /usr/include/c++/4.2.1/bits/stl_tree.h:329: error: template with C linkage etc. etc. It seems a bit presumptuous of mpi.h to just include mpicxx.h just because __cplusplus is defined, since that makes it impossible to link C MPI code from C++. I've had to resort to something like #ifdef __cplusplus #undef __cplusplus #include #define __cplusplus #else #include #endif in c-code.h, which seems to work but isn't exactly smooth. Is there another way around this, or has linking C MPI code with C++ never come up before? Thanks, /Patrik Jonsson
Re: [OMPI users] Problem including C MPI code from C++ using C linkage
Hi everyone, Thanks for the suggestions. On Thu, Sep 2, 2010 at 6:41 AM, Jeff Squyres wrote: > On Aug 31, 2010, at 5:39 PM, Patrik Jonsson wrote: > >> It seems a bit presumptuous of mpi.h to just include mpicxx.h just >> because __cplusplus is defined, since that makes it impossible to link >> C MPI code from C++. > > The MPI standard requires that work in both C and C++ applications. > It also requires that include all the C++ binding prototypes when > relevant. Hence, there's not much we can do here. Ah, I see. That seems unfortunate, but I guess it's out of your hands. > As Lisandro noted, it's probably best to separate outside of your > file. I tried the suggestion of simply including mpi.h in C++-mode before including c-code.h, and that works. I should have thought of that. (c-code.h still needs to include mpi.h because it's also a standalone code that uses mpi.) > > Or, you can make your file be safe for C++ by doing something like > in c-code.h: > > #include > > #ifdef __cplusplus > #extern "C" { > #endif > ...all your C declarations... > #ifdef __cplusplus > } > #endif > > This is probably preferable because then your is safe for both C > and C++, and you keep contained inside it (assumedly preserving some > abstraction barriers in your code by keeping the MPI prototypes bundled with > ). This is also a good suggestions, but I have only scant control over what's in c-code.h so it's a bit invasive. In any case I can live with including mpi.h myself first, so I'll go with that solution. Regards, /Patrik
[OMPI users] Asymmetric performance with nonblocking, multithreaded communications
Hi all, I'm seeing performance issues I don't understand in my multithreaded MPI code, and I was hoping someone could shed some light on this. The code structure is as follows: A computational domain is decomposed into MPI tasks. Each MPI task has a "master thread" that receives messages from the other tasks and puts those into a local, concurrent queue. The tasks then have a few "worker threads" that processes the incoming messages and when necessary sends them to other tasks. So for each task, there is one thread doing receives and N (typically number of cores-1) threads doing sends. All messages are nonblocking, so the workers just post the sends and continue with computation, and the master repeatedly does a number of test calls to check for incoming messages (there are different flavors of these messages so it does several tests). Currently I'm just testing, so I'm running 2 tasks using the sm btl on one node, and 5 worker threads. (Node has 12 cores.) What happens is that task 0 receives everything that is sent by task 1 (number of sends and receives roughly match). However, task 1 only receives about 25% of the messages sent by task 0. Task 0 apparently has no problem keeping up with receiving the messages from task 1, even though the throughput in that direction is actually a bit higher. In less than a minute, there are hundreds of thousands of pending messages (but only in one direction).At this point, throughput drops by orders of magnitude to <1000 msg/s. Using PAPI, I can see that the receiving threads are at that point basically stalled on MPI tests and receives, and stopping them in the debugger seems to indicate that they are trying to acquire a lock. However, the test/receive that it is stalling on is NOT the test for the huge number of pending messages, but on another class of much rarer ones. I realize it's hard to know without looking at the code (it's difficult to whittle it down to a workable example) but does anyone have any ideas what is happening and how it can be fixed? I don't know if there are any problems with the basic structure of the code. For example, are the simultaneous send/receives in different threads bound to cause lock contention on the MPI side? How does the MPI library decide which thread is used for actual message processing? Does every nonblocking MPI call just "steal" a time slice to work on communications or does MPI have its own thread dedicated to message processing? What I would like is that the master thread devote all its time to communication, while the sends by the worker threads should just return as fast as possible. Would it be better that the thread doing receives do one large wait instead of repeatedly testing different sets of requests, or would that acquire some lock and then block the threads trying to post a send? I've looked around for info on how to best structure multithreaded MPI code, but haven't had much luck in finding anything. This is with OpenMPI 1.5.3 using MPI_THREAD_MULTIPLE on a Dell PowerEdge C6100 running linux kernel 2.6.18-194.32.1.el5, using Intel 12.3.174. I've attached the ompi_info output. Thanks, /Patrik J. ompi_info.gz Description: GNU Zip compressed data
Re: [OMPI users] Asymmetric performance with nonblocking, multithreaded communications
Replying to my own post, I'd like to add some info: After making the master thread put more of a premium on receiving the missing messages, the problem went away. Both tasks now appear to keep up on the messages sent from the other. However, after about a minute and ~1.5e6 messages exchanged, both tasks segfault after printing the following error: [sunrise01.rc.fas.harvard.edu:10009] mca_btl_sm_component_progress read an unknown type of header The debugger spits me out on line 674 of btl_sm_component.c, in the default of a switch on fragment type. There's a comment there that says: * This code path should presumably never be called. * It's unclear if it should exist or, if so, how it should be written. * If we want to return it to the sending process, * we have to figure out who the sender is. * It seems we need to subtract the mask bits. * Then, hopefully this is an sm header that has an smp_rank field. * Presumably that means the received header was relative. * Or, maybe this code should just be removed. That seems worrisome, like whoever wrote the code didn't know what was going on... I've gotten that error previously, but only when millions of outstanding messages had built up. Now, that's not the case. Does anyone have any idea what could be going on here? Thanks, /Patrik J.
Re: [OMPI users] Asymmetric performance with nonblocking, multithreaded communications
Hi Yiannis, On Fri, Dec 9, 2011 at 10:21 AM, Yiannis Papadopoulos wrote: > Patrik Jonsson wrote: >> >> Hi all, >> >> I'm seeing performance issues I don't understand in my multithreaded >> MPI code, and I was hoping someone could shed some light on this. >> >> The code structure is as follows: A computational domain is decomposed >> into MPI tasks. Each MPI task has a "master thread" that receives >> messages from the other tasks and puts those into a local, concurrent >> queue. The tasks then have a few "worker threads" that processes the >> incoming messages and when necessary sends them to other tasks. So for >> each task, there is one thread doing receives and N (typically number >> of cores-1) threads doing sends. All messages are nonblocking, so the >> workers just post the sends and continue with computation, and the >> master repeatedly does a number of test calls to check for incoming >> messages (there are different flavors of these messages so it does >> several tests). > > When do you do the MPI_Test on the Isends? I have had performance issues in > a number of systems if I would use a single queue of MPI_Requests that would > keep Isends to different ranks and testing them one by one. It appears that > some messages are sent out more efficiently if you test them. There are 3 classes of messages that may arrive. The requests for each are in a vector, and I use boost::mpi::test_some (which I assume just calls MPI_Testsome) to test them in a round-robin fashion. > > I found that either using MPI_Testsome or having a map(key=rank, value=queue > of MPI_Requests) and testing for each key the first MPI_Request, resolved > this issue. In my case, I know that the overwhelming traffic volume is one kind of message. What I ended up doing was to simply repeat the test for that message immediately if the preceding test succeeded, up to 1000 times, before again checking the other requests. This appears to enable the task to keep up with the incoming traffic. I guess another possibility would be to have several slots for the incoming messages. Right now I only post one irecv per source task. By posting a couple, more messages would be able to come in without not having a matching recv, and one test could match more of them. Since that makes the logic more complicated, I didn't try that.
[OMPI users] mca_btl_sm_component_progress read an unknown type of header
Hi all, This question was buried in an earlier question, and I got no replies, so I'll try reposting it with a more enticing subject. I have a multithreaded openmpi code where each task has N+1 threads, the N threads send nonblocking messages that are received by the 1 thread on the other tasks. When I run this code with 2 tasks, 5+1 threads on a single node with 12 cores, after about a million messages has been exchanged, the tasks segfault after printing the following error: [sunrise01.rc.fas.harvard.edu:10009] mca_btl_sm_component_progress read an unknown type of header The debugger spits me out on line 674 of btl_sm_component.c, in the default of a switch on fragment type. There's a comment there that says: * This code path should presumably never be called. * It's unclear if it should exist or, if so, how it should be written. * If we want to return it to the sending process, * we have to figure out who the sender is. * It seems we need to subtract the mask bits. * Then, hopefully this is an sm header that has an smp_rank field. * Presumably that means the received header was relative. * Or, maybe this code should just be removed. It seems like whoever wrote that code didn't know quite what was going on, and I guess the assumption was wrong because dereferencing that result seems to be what's causing the segfault. Does anyone here know what could cause this error? If I run the code with the tcp btl instead of sm, it runs fine, albeit with a bit lower performance. This is with OpenMPI 1.5.3 using MPI_THREAD_MULTIPLE on a Dell PowerEdge C6100 running linux kernel 2.6.18-194.32.1.el5, using Intel 12.3.174. I've attached the ompi_info output. Thanks, /Patrik J. ompi_info.gz Description: GNU Zip compressed data
[OMPI users] invalid write in opal_generic_simple_unpack
Hi, I'm trying to track down a spurious segmentation fault that I'm getting with my MPI application. I tried using valgrind, and after suppressing the 25,000 errors in PMPI_Init_thread and associated Init/Finalize functions, I'm left with an uninitialized write in PMPI_Isend (which I saw is not unexpected), plus this: ==11541== Thread 1: ==11541== Invalid write of size 1 ==11541==at 0x4A09C9F: _intel_fast_memcpy (mc_replace_strmem.c:650) ==11541==by 0x5093447: opal_generic_simple_unpack (opal_datatype_unpack.c:420) ==11541==by 0x508D642: opal_convertor_unpack (opal_convertor.c:302) ==11541==by 0x4F8FD1A: mca_pml_ob1_recv_frag_callback_match (pml_ob1_recvfrag.c:217) ==11541==by 0x4ED51BD: mca_btl_tcp_endpoint_recv_handler (btl_tcp_endpoint.c:718) ==11541==by 0x509644F: opal_event_loop (event.c:766) ==11541==by 0x507FA50: opal_progress (opal_progress.c:189) ==11541==by 0x4E95AFE: ompi_request_default_test (req_test.c:88) ==11541==by 0x4EB8077: PMPI_Test (ptest.c:61) ==11541==by 0x78C4339: boost::mpi::request::test() (in /n/home00/pjonsson/lib/libboost_mpi.so.1 .48.0) ==11541==by 0x4B5DA3: mcrx::mpi_master::process_handshakes() (mpi_master_impl.h:216) ==11541==by 0x4B5557: mcrx::mpi_master::run() (mpi_master_impl.h:541) ==11541== Address 0x7feffb327 is just below the stack ptr. To suppress, use: --workaround-gcc296- bugs=yes The test in question tests for a single int being sent between the tasks. This is done using the Boost::MPI skeleton/content mechanism, and the receive is done to an element of a std::vector, so there's no reason it should unpack anywhere near the stack ptr. However, an int should be size 4. This looks suspicious given that the segfault would usually happen in one of the calls to PMPI_Test. If somehow the data is unpacked to somewhere around the stack pointer, that certainly seems like a possible cause. If anyone can give me some ideas for what could cause this and how to track it down, I'd appreciate it. I'm running out of ideas here. Regards, /Patrik J.
Re: [OMPI users] invalid write in opal_generic_simple_unpack
On Wed, Mar 14, 2012 at 3:43 PM, Jeffrey Squyres wrote: > On Mar 14, 2012, at 9:38 AM, Patrik Jonsson wrote: > >> I'm trying to track down a spurious segmentation fault that I'm >> getting with my MPI application. I tried using valgrind, and after >> suppressing the 25,000 errors in PMPI_Init_thread and associated >> Init/Finalize functions, > > I haven't looked at these in a while, but the last time I looked, many/most > of them came from one of several sources: > > - OS-bypass network mechanisms (i.e., the memory is ok, but valgrind isn't > aware of it) > - weird optimizations from the compiler (particularly from non-gcc compilers) > - weird optimizations in glib or other support libraries > - Open MPI sometimes specifically has "holes" of uninitialized data that we > memcpy (long story short: it can be faster to copy a large region that > contains a hole rather than doing 2 memcopies of the fully-initialized > regions) > > Other than what you cited below, are you seeing others? What version of Open > MPI is this? Did you --enable-valgrind when you configured Open MPI? This > can reduce a bunch of these kinds of warnings. I didn't install OpenMPI myself, but I doubt it was configured with this. > >> I'm left with an uninitialized write in >> PMPI_Isend (which I saw is not unexpected), plus this: >> >> ==11541== Thread 1: >> ==11541== Invalid write of size 1 >> ==11541== at 0x4A09C9F: _intel_fast_memcpy (mc_replace_strmem.c:650) > > That doesn't seem right. It's an *invalid* write, not an *uninitialized* > access. Could be serious. > >> ==11541== by 0x5093447: opal_generic_simple_unpack >> (opal_datatype_unpack.c:420) >> ==11541== by 0x508D642: opal_convertor_unpack (opal_convertor.c:302) >> ==11541== by 0x4F8FD1A: mca_pml_ob1_recv_frag_callback_match >> (pml_ob1_recvfrag.c:217) >> ==11541== by 0x4ED51BD: mca_btl_tcp_endpoint_recv_handler >> (btl_tcp_endpoint.c:718) >> ==11541== by 0x509644F: opal_event_loop (event.c:766) >> ==11541== by 0x507FA50: opal_progress (opal_progress.c:189) >> ==11541== by 0x4E95AFE: ompi_request_default_test (req_test.c:88) >> ==11541== by 0x4EB8077: PMPI_Test (ptest.c:61) >> ==11541== by 0x78C4339: boost::mpi::request::test() (in >> /n/home00/pjonsson/lib/libboost_mpi.so.1 >> .48.0) > > It looks like this is happening in the TCP receive handler; it received some > data from a TCP socket and is trying to copy it to the final, MPI-specified > receive buffer. > > If you can attach the debugger here, per chance, it might be useful to verify > that OMPI is copying to the target buffer that was assumedly specified in a > prior call to MPI_IRECV (and also double check that this buffer is still > valid). The problem was that there were many sends and this error was spurious, so it was hard to know whether I stopped in the right unpack. I think I tracked it down, though. The problem was in the boost.mpi "skeleton/content" feature (which has bitten me in the past). Essentially, any serialization operator that uses a temporary will silently give incorrect results when using skeleton/content, because the get_content operator captures the location of the temporary when building the custom MPI data type, which then causes the data to get deposited in some invalid location. There is scant documentation of this feature and the above conclusion is my own, but I'm pretty sure it's correct. Even the built-in boost serializations aren't safe. Serializing an enum, for example, uses a temporary and will thus not work correctly with these operators. > Is there any chance that you can provide a small reproducer in C without all > the Boost stuff? As is clear from the above, no. The problem was in my code and boost. I do have a more general question, though: Is there a good way to back out the location of the request object if I stop deep in the bowels of MPI. As I understand it, just because the user-level call is a certain MPI_Test doesn't mean that under the hood it's working on other requests, but this nonlocality makes it difficult to track down errors. Thanks, /Patrik