On Mar 1, 2012, at 1:17 AM, Jingcha Joba wrote:

> Aah...
> So when openMPI is compile with OFED, and run on a Infiniband/RoCE devices, I 
> would use the mpi would simply direct to ofed to do point to point calls in 
> the ofed way?

I'm not quite sure how to parse that.  :-)

The openib BTL uses verbs functions to effect data transfers between MPI 
process peers.  The BTL is one of the lower layers in Open MPI for 
point-to-point communication; BTL plugins are used to effect the 
device-specific transport stuff for MPI_SEND, MPI_RECV, MPI_PUT, ...etc.  
Hence, when you run with the openib BTL and call MPI_SEND (assumedly to a peer 
that is reachable via an OpenFabrics device), the openib BTL will eventually be 
called to actually send the message.  The openib BTL will send the message to 
the peer via calls to some combination of calls to verbs functions.

Mellanox has also introduced a library called "MXM" that can also be used for 
underlying MPI message transport (as opposed to using the openib BTL).  See the 
Open MPI README for some explanations about the different transports that Open 
MPI can use (specifically: "ob1" vs. "cm").

> > More specifically: all things being equal, you don't care which is used.  
> > You just want your message to get to the receiver/target as fast as 
> > possible.  One of the main ideas of MPI is to hide those kinds of details 
> > from the user.  I.e., you call MPI_SEND.  A miracle occurs.  The message is 
> > received on the other side.
> 
> True. Its just that I am digging into the OFED source code and the ompi 
> source code,and trying to understand the way these two interact..

The openib BTL is probably one of the most complex sections of Open MPI, 
unfortunately.  :-\  The verbs API is *quite* complex, and has many different 
options that do not work on all types of OpenFabrics hardware.  This leads to 
many different blocks of code, not all of which are executed on all platforms.  
The verbs model of registering memory also leads to a lot of complications, 
especially since, for performance reasons, MPI has to cache memory 
registrations and interpose itself in the memory subsystem to catch when 
registered memory is freed (see the README for some details here).  

If you have any specific questions about the implementation, post over on the 
devel list.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to