Kris,

Using MX_CSUM should _not_ make a difference by itself. But it requires the debug library which may alter the timing enough to avoid a race (in MX, OMPI, or the application).

Correct, if you use the MTL then all messages are handled by MX (internode, shared memory and self).

Scott

On Jul 3, 2009, at 7:41 AM, 8mj6tc...@sneakemail.com wrote:

Scott,

Thanks for your advice! Good to know about the checksum debug
functionality! Strangely enough running with either "MX_CSUM=1" or "- mca
pml cm" allows Murasaki to work normally, and makes the test case I
attached in my previous mail work. Very suspicious, but at least this
does make a functional solution (however, if I understand OpenMPI
correctly, I shouldn't be able to use the CM PML over a network where
some nodes have MX and some don't, correct?).

Scott Atchley atchley-at-myri.com |openmpi-users/Allow| wrote:
Hi Kris,

I have not run your code yet, but I will try to this weekend.

You can have MX checksum its messages if you set MX_CSUM=1 and use the
MX debug library (e.g. LD_LIBRARY_PATH to /opt/mx/lib/debug).

Do you have the problem if you use the MX MTL? To test it modify your
mpirun as follows:

$ mpirun -mca pml cm ...

and do not specify any BTL info.

Scott

On Jul 2, 2009, at 6:05 PM, 8mj6tc...@sneakemail.com wrote:

Hi. I've now spent many many hours tracking down a bug that was causing my program to die, as though either its memory were getting corrupted or
messages were getting clobbered while going through the network, I
couldn't tell which. I really wish the checksum flag on btl_mx_flags
were working. But anyway, I think I've managed to recreate the core of
the problem in a small-ish test case which I've attached
(verifycontent.cc). This usually segfaults at MPI_Issend after sending
about 60-90 messages for me while using OpenMPI 1.3.2 with myricom's
mx-1.2.9 drivers on linux using gcc 4.3.2. Disabling the mx btl (mpirun
-mca btl ^mx) makes it work (likewise, the same for my own larger
project (Murasaki)). The MPI_Ssend using version
(verifycontent-ssend.cc) also works no problem over mx. So I suspect the issue lies in OpenMPI 1.3.2's handling of MPI_Issend over mx, but it's also possible I've horribly misunderstood something fundamental about MPI and it's just my fault, so if that's the case, please let me know (but both my this test case and Murasaki work over mpichmx, so OpenMPI
is definitely doing something different).

Here's a brief description of verifycontent.cc to make reading it easier: * given -np=N, half the nodes will be sending, half will be receiving
some number of messages (reps)
* each message consists of buflen (5000) chars, set to some value based
on the sending node's rank and the sequence number of the message
* the receiving node starts an irecv for each sending node, tests each
request until a message arrives
* the receiver then checks the contents of the message to make sure it matches what was supposed to be in there (this is where my real project,
Murasaki, fails actually. I can't seem to replicate that however).
* the senders meanwhile keep sending messages and dequeuing them when
their request tests as completed.

Testing out the current subversion trunk version, 1.4a1r21594, that
seems to pass my test case, but also tends to show errors like
"mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy)" on start up, and Murasaki still fails (messages turn into zeros about 132KB
in), so something still isn't right...

If anyone has any ideas about this test case failing, or my larger issue of messages turning into zeros after 132KB (though sadly sometimes it isn't at 132KB, but straight from 0KB, which is very confusing) while on
MX, I'd greatly appreciate it. Even a simple confirmation of "Yes,
MPI_Issend/Irecv with MX has issues in 1.3.2" would help my sanity.
--
Kris Popendorf

Keio University
http://murasaki................... <- (Probably too cumbersome to expect
most people to test, but if you feel daring, try putting in some
Human/Mouse chromosomes over MX)
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
--Kris

叶ってしまう夢は本当の夢と言えん。
[A dream that comes true can't really be called a dream.]
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to