Mike -

In Open MPI 1.2, one-sided is implemented over point-to-point, so I would expect it to be slower. This may or may not be addressed in a future version of Open MPI (I would guess so, but don't want to commit to it). Where you using multiple threads? If so, how?

On the good news, I think your call stack looked similar to what I was seeing, so hopefully I can make some progress on a real solution.

Brian

On Mar 20, 2007, at 8:54 PM, Mike Houston wrote:

Well, I've managed to get a working solution, but I'm not sure how I got
there.  I built a test case that looked like a nice simple version of
what I was trying to do and it worked, so I moved the test code into my
implementation and low and behold it works.  I must have been doing
something a little funky in the original pass, likely causing a stack
smash somewhere or trying to do a get/put out of bounds.

If I have any more problems, I'll let y'all know.  I've tested pretty
heavy usage up to 128 MPI processes across 16 nodes and things seem to
be behaving.  I did notice that single sided transfers seem to be a
little slower than explicit send/recv, at least on GigE. Once I do some
more testing, I'll bring things up on IB and see how things are going.

-Mike

Mike Houston wrote:
Brian Barrett wrote:

On Mar 20, 2007, at 3:15 PM, Mike Houston wrote:



If I only do gets/puts, things seem to be working correctly with
version
1.2. However, if I have a posted Irecv on the target node and issue a
MPI_Get against that target, MPI_Test on the posed IRecv causes a
segfaults:

Anyone have suggestions? Sadly, I need to have IRecv's posted. I'll
attempt to find a workaround, but it looks like the posed IRecv is
getting all the data of the MPI_Get from the other node.  It's like
the
message tagging is getting ignored.  I've never tried posting two
different IRecv's with different message tags either...


Hi Mike -

I've spent some time this afternoon looking at the problem and have
some ideas on what could be happening.  I don't think it's a data
mismatch (the data intended for the IRecv getting delivered to the
Get), but more a problem with the call to MPI_Test perturbing the
progress flow of the one-sided engine.  I can see one or two places
where it's possible this could happen, although I'm having trouble
replicating the problem with any test case I can write.  Is it
possible for you to share the code causing the problem (or some small
test case)?  It would make me feel considerably better if I could
really understand the conditions required to end up in a seg fault
state.

Thanks,

Brian


Well, I can give you a linux x86 binary if that would do it. The code is huge as it's part of a much larger system, so there is no such thing
as a simple case at the moment, and the code is in pieces an largely
unrunnable now with all the hacking...

I basically have one thread spinning on an MPI_Test on a posted IRecv
while being used as the target to the MPI_Get. I'll see if I can hack
together a simple version that breaks late tonight.  I've just played
with posting a send to that IRecv, issuing the MPI_Get, handshaking and then posting another IRecv and the MPI_Test continues to eat it, but in
a memcpy:

#0  0x001c068c in memcpy () from /lib/libc.so.6
#1  0x00e412d9 in ompi_convertor_pack (pConv=0x83c1198, iov=0xa0,
out_size=0xaffc1fd8, max_data=0xaffc1fdc) at convertor.c:254
#2  0x00ea265d in ompi_osc_pt2pt_replyreq_send (module=0x856e668,
replyreq=0x83c1180) at osc_pt2pt_data_move.c:411
#3  0x00ea0ebe in ompi_osc_pt2pt_component_fragment_cb
(pt2pt_buffer=0x8573380) at osc_pt2pt_component.c:582
#4 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769
#5  0x00aa3019 in opal_progress () at runtime/opal_progress.c:288
#6  0x00ea59e5 in ompi_osc_pt2pt_passive_unlock (module=0x856e668,
origin=1, count=1) at osc_pt2pt_sync.c:60
#7  0x00ea0cd2 in ompi_osc_pt2pt_component_fragment_cb
(pt2pt_buffer=0x856f300) at osc_pt2pt_component.c:688
#8 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769
#9  0x00aa3019 in opal_progress () at runtime/opal_progress.c:288
#10 0x00e33f05 in ompi_request_test (rptr=0xaffc2430,
completed=0xaffc2434, status=0xaffc23fc) at request/req_test.c:82
#11 0x00e61770 in PMPI_Test (request=0xaffc2430, completed=0xaffc2434,
status=0xaffc23fc) at ptest.c:52

-Mike
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to