Hi. I've now spent many many hours tracking down a bug that was
causing
my program to die, as though either its memory were getting
corrupted or
messages were getting clobbered while going through the network, I
couldn't tell which. I really wish the checksum flag on btl_mx_flags
were working. But anyway, I think I've managed to recreate the
core of
the problem in a small-ish test case which I've attached
(verifycontent.cc). This usually segfaults at MPI_Issend after
sending
about 60-90 messages for me while using OpenMPI 1.3.2 with myricom's
mx-1.2.9 drivers on linux using gcc 4.3.2. Disabling the mx btl
(mpirun
-mca btl ^mx) makes it work (likewise, the same for my own larger
project (Murasaki)). The MPI_Ssend using version
(verifycontent-ssend.cc) also works no problem over mx. So I
suspect the
issue lies in OpenMPI 1.3.2's handling of MPI_Issend over mx, but
it's
also possible I've horribly misunderstood something fundamental
about
MPI and it's just my fault, so if that's the case, please let me
know
(but both my this test case and Murasaki work over mpichmx, so
OpenMPI
is definitely doing something different).
Here's a brief description of verifycontent.cc to make reading it
easier:
* given -np=N, half the nodes will be sending, half will be
receiving
some number of messages (reps)
* each message consists of buflen (5000) chars, set to some value
based
on the sending node's rank and the sequence number of the message
* the receiving node starts an irecv for each sending node, tests
each
request until a message arrives
* the receiver then checks the contents of the message to make
sure it
matches what was supposed to be in there (this is where my real
project,
Murasaki, fails actually. I can't seem to replicate that however).
* the senders meanwhile keep sending messages and dequeuing them
when
their request tests as completed.
Testing out the current subversion trunk version, 1.4a1r21594, that
seems to pass my test case, but also tends to show errors like
"mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy)"
on
start up, and Murasaki still fails (messages turn into zeros about
132KB
in), so something still isn't right...
If anyone has any ideas about this test case failing, or my larger
issue
of messages turning into zeros after 132KB (though sadly sometimes
it
isn't at 132KB, but straight from 0KB, which is very confusing)
while on
MX, I'd greatly appreciate it. Even a simple confirmation of "Yes,
MPI_Issend/Irecv with MX has issues in 1.3.2" would help my sanity.
--
Kris Popendorf
Keio University
http://murasaki................... <- (Probably too cumbersome to
expect
most people to test, but if you feel daring, try putting in some
Human/Mouse chromosomes over MX)
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users