[OMPI users] Problem including C MPI code from C++ using C linkage

2010-08-31 Thread Patrik Jonsson
Hi all,

I'm have a C MPI code that I need to link into my C++ code. As usual,
from my C++ code, I do

extern "C" {
#include "c-code.h"
}

where c-code.h includes, among other things, mpi.h.

This doesn't work, because it appears mpi.h tries to detect whether
it's being compiled as C or C++ and includes mpicxx.h if the language
is C++. The problem is that that doesn't work in C linkage, so the
compilation dies with errors like:

mpic++  -I. -I$HOME/include/libPJutil -I$HOME/code/arepo -m32
arepotest.cc -I$HOME/include -I/sw/include -L/sw/lib
-L$HOME/code/arepo -larepo -lhdf5  -lgsl -lgmp -lmpi
In file included from /usr/include/c++/4.2.1/map:65,
                from /sw/include/openmpi/ompi/mpi/cxx/mpicxx.h:36,
                from /sw/include/mpi.h:1886,
                from /Users/patrik/code/arepo/allvars.h:23,
                from /Users/patrik/code/arepo/proto.h:2,
                from arepo_grid.h:36,
                from arepotest.cc:3:
/usr/include/c++/4.2.1/bits/stl_tree.h:134: error: template with C linkage
/usr/include/c++/4.2.1/bits/stl_tree.h:145: error: declaration of C
function 'const std::_Rb_tree_node_base* std::_Rb_tree_increment(const
std::_Rb_tree_node_base*)' conflicts with
/usr/include/c++/4.2.1/bits/stl_tree.h:142: error: previous
declaration 'std::_Rb_tree_node_base*
std::_Rb_tree_increment(std::_Rb_tree_node_base*)' here
/usr/include/c++/4.2.1/bits/stl_tree.h:151: error: declaration of C
function 'const std::_Rb_tree_node_base* std::_Rb_tree_decrement(const
std::_Rb_tree_node_base*)' conflicts with
/usr/include/c++/4.2.1/bits/stl_tree.h:148: error: previous
declaration 'std::_Rb_tree_node_base*
std::_Rb_tree_decrement(std::_Rb_tree_node_base*)' here
/usr/include/c++/4.2.1/bits/stl_tree.h:153: error: template with C linkage
/usr/include/c++/4.2.1/bits/stl_tree.h:223: error: template with C linkage
/usr/include/c++/4.2.1/bits/stl_tree.h:298: error: template with C linkage
/usr/include/c++/4.2.1/bits/stl_tree.h:304: error: template with C linkage
/usr/include/c++/4.2.1/bits/stl_tree.h:329: error: template with C linkage
etc. etc.

It seems a bit presumptuous of mpi.h to just include mpicxx.h just
because __cplusplus is defined, since that makes it impossible to link
C MPI code from C++.

I've had to resort to something like

#ifdef __cplusplus
#undef __cplusplus
#include 
#define __cplusplus
#else
#include 
#endif

in c-code.h, which seems to work but isn't exactly smooth. Is there
another way around this, or has linking C MPI code with C++ never come
up before?

Thanks,

/Patrik Jonsson



Re: [OMPI users] Problem including C MPI code from C++ using C linkage

2010-09-03 Thread Patrik Jonsson
Hi everyone,

Thanks for the suggestions.

On Thu, Sep 2, 2010 at 6:41 AM, Jeff Squyres  wrote:
> On Aug 31, 2010, at 5:39 PM, Patrik Jonsson wrote:
>
>> It seems a bit presumptuous of mpi.h to just include mpicxx.h just
>> because __cplusplus is defined, since that makes it impossible to link
>> C MPI code from C++.
>
> The MPI standard requires that  work in both C and C++ applications.  
> It also requires that  include all the C++ binding prototypes when 
> relevant.  Hence, there's not much we can do here.

Ah, I see. That seems unfortunate, but I guess it's out of your hands.

> As Lisandro noted, it's probably best to separate  outside of your 
>  file.

I tried the suggestion of simply including mpi.h in C++-mode before
including c-code.h, and that works. I should have thought of that.
(c-code.h still needs to include mpi.h because it's also a standalone
code that uses mpi.)

>
> Or, you can make your  file be safe for C++ by doing something like 
> in c-code.h:
>
> #include 
>
> #ifdef __cplusplus
> #extern "C" {
> #endif
> ...all your C declarations...
> #ifdef __cplusplus
> }
> #endif
>
> This is probably preferable because then your  is safe for both C 
> and C++, and you keep  contained inside it (assumedly preserving some 
> abstraction barriers in your code by keeping the MPI prototypes bundled with 
> ).

This is also a good suggestions, but I have only scant control over
what's in c-code.h so it's a bit invasive.

In any case I can live with including mpi.h myself first, so I'll go
with that solution.

Regards,

/Patrik



[OMPI users] Asymmetric performance with nonblocking, multithreaded communications

2011-11-30 Thread Patrik Jonsson
Hi all,

I'm seeing performance issues I don't understand in my multithreaded
MPI code, and I was hoping someone could shed some light on this.

The code structure is as follows: A computational domain is decomposed
into MPI tasks. Each MPI task has a "master thread" that receives
messages from the other tasks and puts those into a local, concurrent
queue. The tasks then have a few "worker threads" that processes the
incoming messages and when necessary sends them to other tasks. So for
each task, there is one thread doing receives and N (typically number
of cores-1) threads doing sends. All messages are nonblocking, so the
workers just post the sends and continue with computation, and the
master repeatedly does a number of test calls to check for incoming
messages (there are different flavors of these messages so it does
several tests).

Currently I'm just testing, so I'm running 2 tasks using the sm btl on
one node, and 5 worker threads. (Node has 12 cores.) What happens is
that task 0 receives everything that is sent by task 1 (number of
sends and receives roughly match). However, task 1 only receives about
25% of the messages sent by task 0. Task 0 apparently has no problem
keeping up with receiving the messages from task 1, even though the
throughput in that direction is actually a bit higher. In less than a
minute, there are hundreds of thousands of pending messages (but only
in one direction).At this point, throughput drops by orders of
magnitude to <1000 msg/s. Using PAPI, I can see that the receiving
threads are at that point basically stalled on MPI tests and receives,
and stopping them in the debugger seems to indicate that they are
trying to acquire a lock. However, the test/receive that it is
stalling on is NOT the test for the huge number of pending messages,
but on another class of much rarer ones.

I realize it's hard to know without looking at the code (it's
difficult to whittle it down to a workable example) but does anyone
have any ideas what is happening and how it can be fixed?  I don't
know if there are any problems with the basic structure of the code.
For example, are the simultaneous send/receives in different threads
bound to cause lock contention on the MPI side? How does the MPI
library decide which thread is used for actual message processing?
Does every nonblocking MPI call just "steal" a time slice to work on
communications or does MPI have its own thread dedicated to message
processing? What I would like is that the master thread devote all its
time to communication, while the sends by the worker threads should
just return as fast as possible.  Would it be better that the thread
doing receives do one large wait instead of repeatedly testing
different sets of requests, or would that acquire some lock and then
block the threads trying to post a send?

I've looked around for info on how to best structure multithreaded MPI
code, but haven't had much luck in finding anything.

This is with OpenMPI 1.5.3 using MPI_THREAD_MULTIPLE on a Dell
PowerEdge C6100 running linux kernel 2.6.18-194.32.1.el5, using Intel
12.3.174. I've attached the ompi_info output.

Thanks,

/Patrik J.


ompi_info.gz
Description: GNU Zip compressed data


Re: [OMPI users] Asymmetric performance with nonblocking, multithreaded communications

2011-11-30 Thread Patrik Jonsson
Replying to my own post, I'd like to add some info:

After making the master thread put more of a premium on receiving the
missing messages, the problem went away. Both tasks now appear to keep
up on the messages sent from the other. However, after about a minute
and ~1.5e6 messages exchanged, both tasks segfault after printing the
following error:

[sunrise01.rc.fas.harvard.edu:10009] mca_btl_sm_component_progress
read an unknown type of header

The debugger spits me out on line 674 of btl_sm_component.c, in the
default of a switch on fragment type. There's a comment there that
says:

* This code path should presumably never be called.
* It's unclear if it should exist or, if so, how it should be written.
* If we want to return it to the sending process,
* we have to figure out who the sender is.
* It seems we need to subtract the mask bits.
* Then, hopefully this is an sm header that has an smp_rank field.
* Presumably that means the received header was relative.
* Or, maybe this code should just be removed.

That seems worrisome, like whoever wrote the code didn't know what was
going on... I've gotten that error previously, but only when millions
of outstanding messages had built up. Now, that's not the case.

Does anyone have any idea what could be going on here?

Thanks,

/Patrik J.


Re: [OMPI users] Asymmetric performance with nonblocking, multithreaded communications

2011-12-09 Thread Patrik Jonsson
Hi Yiannis,

On Fri, Dec 9, 2011 at 10:21 AM, Yiannis Papadopoulos
 wrote:
> Patrik Jonsson wrote:
>>
>> Hi all,
>>
>> I'm seeing performance issues I don't understand in my multithreaded
>> MPI code, and I was hoping someone could shed some light on this.
>>
>> The code structure is as follows: A computational domain is decomposed
>> into MPI tasks. Each MPI task has a "master thread" that receives
>> messages from the other tasks and puts those into a local, concurrent
>> queue. The tasks then have a few "worker threads" that processes the
>> incoming messages and when necessary sends them to other tasks. So for
>> each task, there is one thread doing receives and N (typically number
>> of cores-1) threads doing sends. All messages are nonblocking, so the
>> workers just post the sends and continue with computation, and the
>> master repeatedly does a number of test calls to check for incoming
>> messages (there are different flavors of these messages so it does
>> several tests).
>
> When do you do the MPI_Test on the Isends? I have had performance issues in
> a number of systems if I would use a single queue of MPI_Requests that would
> keep Isends to different ranks and testing them one by one. It appears that
> some messages are sent out more efficiently if you test them.

There are 3 classes of messages that may arrive. The requests for each
are in a vector, and I use boost::mpi::test_some (which I assume just
calls MPI_Testsome) to test them in a round-robin fashion.

>
> I found that either using MPI_Testsome or having a map(key=rank, value=queue
> of MPI_Requests) and testing for each key the first MPI_Request, resolved
> this issue.

In my case, I know that the overwhelming traffic volume is one kind of
message. What I ended up doing was to simply repeat the test for that
message immediately if the preceding test succeeded, up to 1000 times,
before again checking the other requests. This appears to enable the
task to keep up with the incoming traffic.

I guess another possibility would be to have several slots for the
incoming messages. Right now I only post one irecv per source task. By
posting a couple, more messages would be able to come in without not
having a matching recv, and one test could match more of them. Since
that makes the logic more complicated, I didn't try that.


[OMPI users] mca_btl_sm_component_progress read an unknown type of header

2011-12-09 Thread Patrik Jonsson
Hi all,

This question was buried in an earlier question, and I got no replies,
so I'll try reposting it with a more enticing subject.

I have a multithreaded openmpi code where each task has N+1 threads,
the N threads send nonblocking messages that are received by the 1
thread on the other tasks. When I run this code with 2 tasks, 5+1
threads on a single node with 12 cores, after about a million messages
has been exchanged, the tasks segfault after printing the following
error:

[sunrise01.rc.fas.harvard.edu:10009] mca_btl_sm_component_progress
read an unknown type of header

The debugger spits me out on line 674 of btl_sm_component.c, in the
default of a switch on fragment type. There's a comment there that
says:

* This code path should presumably never be called.
* It's unclear if it should exist or, if so, how it should be written.
* If we want to return it to the sending process,
* we have to figure out who the sender is.
* It seems we need to subtract the mask bits.
* Then, hopefully this is an sm header that has an smp_rank field.
* Presumably that means the received header was relative.
* Or, maybe this code should just be removed.

It seems like whoever wrote that code didn't know quite what was going
on, and I guess the assumption was wrong because dereferencing that
result seems to be what's causing the segfault. Does anyone here know
what could cause this error? If I run the code with the tcp btl
instead of sm, it runs fine, albeit with a bit lower performance.

This is with OpenMPI 1.5.3 using MPI_THREAD_MULTIPLE on a Dell
PowerEdge C6100 running linux kernel 2.6.18-194.32.1.el5, using Intel
12.3.174. I've attached the ompi_info output.

Thanks,

/Patrik J.


ompi_info.gz
Description: GNU Zip compressed data


[OMPI users] invalid write in opal_generic_simple_unpack

2012-03-14 Thread Patrik Jonsson
Hi,

I'm trying to track down a spurious segmentation fault that I'm
getting with my MPI application. I tried using valgrind, and after
suppressing the 25,000 errors in PMPI_Init_thread and associated
Init/Finalize functions, I'm left with an uninitialized write in
PMPI_Isend (which I saw is not unexpected), plus this:

==11541== Thread 1:
==11541== Invalid write of size 1
==11541==at 0x4A09C9F: _intel_fast_memcpy (mc_replace_strmem.c:650)
==11541==by 0x5093447: opal_generic_simple_unpack
(opal_datatype_unpack.c:420)
==11541==by 0x508D642: opal_convertor_unpack (opal_convertor.c:302)
==11541==by 0x4F8FD1A: mca_pml_ob1_recv_frag_callback_match
(pml_ob1_recvfrag.c:217)
==11541==by 0x4ED51BD: mca_btl_tcp_endpoint_recv_handler
(btl_tcp_endpoint.c:718)
==11541==by 0x509644F: opal_event_loop (event.c:766)
==11541==by 0x507FA50: opal_progress (opal_progress.c:189)
==11541==by 0x4E95AFE: ompi_request_default_test (req_test.c:88)
==11541==by 0x4EB8077: PMPI_Test (ptest.c:61)
==11541==by 0x78C4339: boost::mpi::request::test() (in
/n/home00/pjonsson/lib/libboost_mpi.so.1
.48.0)
==11541==by 0x4B5DA3:
mcrx::mpi_master::process_handshakes()
(mpi_master_impl.h:216)
==11541==by 0x4B5557: mcrx::mpi_master::run()
(mpi_master_impl.h:541)
==11541==  Address 0x7feffb327 is just below the stack ptr.  To
suppress, use: --workaround-gcc296-
bugs=yes

The test in question tests for a single int being sent between the
tasks. This is done using the Boost::MPI skeleton/content mechanism,
and the receive is done to an element of a std::vector, so there's no
reason it should unpack anywhere near the stack ptr. However, an int
should be size 4.

This looks suspicious given that the segfault would usually happen in
one of the calls to PMPI_Test. If somehow the data is unpacked to
somewhere around the stack pointer, that certainly seems like a
possible cause.

If anyone can give me some ideas for what could cause this and how to
track it down, I'd appreciate it. I'm running out of ideas here.

Regards,

/Patrik J.


Re: [OMPI users] invalid write in opal_generic_simple_unpack

2012-03-14 Thread Patrik Jonsson
On Wed, Mar 14, 2012 at 3:43 PM, Jeffrey Squyres  wrote:
> On Mar 14, 2012, at 9:38 AM, Patrik Jonsson wrote:
>
>> I'm trying to track down a spurious segmentation fault that I'm
>> getting with my MPI application. I tried using valgrind, and after
>> suppressing the 25,000 errors in PMPI_Init_thread and associated
>> Init/Finalize functions,
>
> I haven't looked at these in a while, but the last time I looked, many/most 
> of them came from one of several sources:
>
> - OS-bypass network mechanisms (i.e., the memory is ok, but valgrind isn't 
> aware of it)
> - weird optimizations from the compiler (particularly from non-gcc compilers)
> - weird optimizations in glib or other support libraries
> - Open MPI sometimes specifically has "holes" of uninitialized data that we 
> memcpy (long story short: it can be faster to copy a large region that 
> contains a hole rather than doing 2 memcopies of the fully-initialized 
> regions)
>
> Other than what you cited below, are you seeing others?  What version of Open 
> MPI is this?  Did you --enable-valgrind when you configured Open MPI?  This 
> can reduce a bunch of these kinds of warnings.

I didn't install OpenMPI myself, but I doubt it was configured with this.

>
>> I'm left with an uninitialized write in
>> PMPI_Isend (which I saw is not unexpected), plus this:
>>
>> ==11541== Thread 1:
>> ==11541== Invalid write of size 1
>> ==11541==    at 0x4A09C9F: _intel_fast_memcpy (mc_replace_strmem.c:650)
>
> That doesn't seem right.  It's an *invalid* write, not an *uninitialized* 
> access.  Could be serious.
>
>> ==11541==    by 0x5093447: opal_generic_simple_unpack
>> (opal_datatype_unpack.c:420)
>> ==11541==    by 0x508D642: opal_convertor_unpack (opal_convertor.c:302)
>> ==11541==    by 0x4F8FD1A: mca_pml_ob1_recv_frag_callback_match
>> (pml_ob1_recvfrag.c:217)
>> ==11541==    by 0x4ED51BD: mca_btl_tcp_endpoint_recv_handler
>> (btl_tcp_endpoint.c:718)
>> ==11541==    by 0x509644F: opal_event_loop (event.c:766)
>> ==11541==    by 0x507FA50: opal_progress (opal_progress.c:189)
>> ==11541==    by 0x4E95AFE: ompi_request_default_test (req_test.c:88)
>> ==11541==    by 0x4EB8077: PMPI_Test (ptest.c:61)
>> ==11541==    by 0x78C4339: boost::mpi::request::test() (in
>> /n/home00/pjonsson/lib/libboost_mpi.so.1
>> .48.0)
>
> It looks like this is happening in the TCP receive handler; it received some 
> data from a TCP socket and is trying to copy it to the final, MPI-specified 
> receive buffer.
>
> If you can attach the debugger here, per chance, it might be useful to verify 
> that OMPI is copying to the target buffer that was assumedly specified in a 
> prior call to MPI_IRECV (and also double check that this buffer is still 
> valid).

The problem was that there were many sends and this error was
spurious, so it was hard to know whether I stopped in the right
unpack.

I think I tracked it down, though. The problem was in the boost.mpi
"skeleton/content" feature (which has bitten me in the past).
Essentially, any serialization operator that uses a temporary will
silently give incorrect results when using skeleton/content, because
the get_content operator captures the location of the temporary when
building the custom MPI data type, which then causes the data to get
deposited in some invalid location.

There is scant documentation of this feature and the above conclusion
is my own, but I'm pretty sure it's correct. Even the built-in boost
serializations aren't safe. Serializing an enum, for example, uses a
temporary and will thus not work correctly with these operators.

> Is there any chance that you can provide a small reproducer in C without all 
> the Boost stuff?

As is clear from the above, no. The problem was in my code and boost.

I do have a more general question, though: Is there a good way to back
out the location of the request object if I stop deep in the bowels of
MPI. As I understand it, just because the user-level call is a certain
MPI_Test doesn't mean that under the hood it's working on other
requests, but this nonlocality makes it difficult to track down
errors.

Thanks,

/Patrik