Hi,

I am currently working on a parallel app that shows some
issues using MX/BTL (not MTL) with the current trunk version
of OpenMPI.

Basically, for its communication the app needs to do a lot
of random <= 8KB MPI_Isend()s which are polled away using
MPI_Iprobe() and MPI_Recv().  The async send requests are
put into a ring, currently 64 entries, from which they are
MPI_Wait()ed.

The thing is, this works perfectly fine using OpenMPI with
both TCP and MX/MTL, but given a sufficient number of cpus
(currently close to 96), the app hangs quite reproducibly at
some phase when using the trunk's MX/BTL implementation.

[As an aside, the reason for using the BTL here is that I am
actually interested in experimenting with the app over multiple
clusters, in mixed mode MX+TCP, which recently has become
possible using the BTL.  For that mixed version the issue
also pops up.]

As the same issue did not occur with the 1.2 released versions
of OpenMPI, I started to do some digging with the trunk revisions.
Since I had no clue where to begin, basically I did a binary
search of revisions between early 12000 and the most recent one.
It turned out that the issue started to arise at revision 12931,
where (amongst others) the mca_btl_mx_module.super.btl_eager_limit
and mca_btl_mx_module.super.btl_min_send_size were moved to 4K.
If I change these back to the original values (just below 16K
and 32K respectively), the problem goes away (in both r12931
as well as the very recent ones).  Given the max 8K message size
used, this indeed influences the low-level communication behavior.

I am sure some OpenMPI developer knows what to do with the above:-)
If you need more feedback from me, or want me to try alternative
options or configs, just let me know.

Regards,
Kees Verstoep

Reply via email to