There are two distinct layers of software being discussed here:

- the PML (basically the back-end to MPI_SEND and friends)
- the BTL (byte transfer layer, the back-end bit movers for the ob1 and dr PMLs -- this distinction is important because there is nothing in the PML design that forces the use of BTL's; indeed, there is at least one current PML that does not use BTL's as the back-end bit mover [the cm PML])

The ob1 and dr PMLs know nothing about how the back-end bitmovers work (BTL components) -- the BTLs are given considerable freedom to operate within their specific interface contracts.

Generally, ob1/dr queries each BTL component when Open MPI starts up. Each BTL responds with whether it wants to run or not. If it does, it gives back the one or more modules (think of a module as an "instance" of a component). Typically, these modules correspond to multiple NICs / HCAs / network endpoints. For example, if you have 2 ethernet cards, the tcp BTL will create and return 2 modules. ob1 / dr will treat these as two paths to send data (reachability is computed as well, of course -- ob1/dr will only send data down btls for which the target peer is reachable). In general, ob1/dr will round-robin across all available BTL modules when sending large messages (as Gleb has described). See http://www.open-mpi.org/papers/ euro-pvmmpi-2006-hpc-protocols/ for a general description of the ob1/ dr protocols.

The openib BTL can return multiple modules if multiple LIDs are available. So the ob1/dr doesn't know that these are not physical devices -- it just treats each module as an equivalent mechanism to send data.

This is actually somewhat lame as a scheme, and we talked internally about doing something more intelligent. But we decided to hold off and let people (like you!) with real-world apps and networks give this stuff a try and see what really works (and what doesn't work) before trying to implement anything else.

So -- all that explanation aside -- we'd love to hear your feedback with regards to the multi-LID stuff in Open MPI. :-)



On Dec 4, 2006, at 1:27 PM, Chevchenkovic Chevchenkovic wrote:

 Thanks for that.

 Suppose,  if there there are multiple interconnects, say ethernet and
infiniband  and a million byte of data is to be sent, then in this
case the data will be sent through infiniband (since its a fast path
.. please correct me here if i m wrong).

  If there are mulitple such sends, do you mean to say that each send
will go  through  different BTLs in a RR manner if they are connected
to the same port?

 -chev


On 12/4/06, Gleb Natapov <gl...@voltaire.com> wrote:
On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic Chevchenkovic wrote:
Hi,
 It is not clear from the code as mentioned by you from
ompi/mca/pml/ob1/  where exactly the selection of BTL bound to a
particular LID occurs. Could you please specify the file/function name
for the same?
There is no such code there. OB1 knows nothing about LIDs. It does RR
over all available interconnects. It can do RR between ethernet, IB
and Myrinet for instance. BTL presents each LID as different virtual HCA to OB1 and it does round-robin between them without event knowing this
is the same port of the same HCA.

Can you explain what are you trying to achieve?

 -chev


On 12/4/06, Gleb Natapov <gl...@voltaire.com> wrote:
On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic Chevchenkovic wrote:
Also could you please tell me which part of the openMPI code needs to be touched so that I can do some modifications in it to incorporate
changes regarding LID selection...

It depend what do you want to do. The part that does RR over all
available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code doesn't
aware of the fact that it is doing RR over different LIDs and not
different NICs (yet?).

The code that controls what LIDs will be used is in
ompi/mca/btl/openib/btl_openib_component.c.

On 12/4/06, Chevchenkovic Chevchenkovic <chevchenko...@gmail.com> wrote:
Is it possible to control the LID where the send and recvs are
posted.. on either ends?

On 12/3/06, Gleb Natapov <gl...@voltaire.com> wrote:
On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic Chevchenkovic
wrote:
Hi,
 I had this query. I hope some expert replies to it.
I have 2 nodes connected point-to-point using infiniband cable. There
are multiple LIDs for each of the end node ports.
When I give an MPI_Send, Are the sends are posted on different LIDs on each of the end nodes OR they are they posted on the same LID?
 Awaiting your reply,
It depend what version of Open MPI your are using. If you are using trunk or v1.2 beta then all available LIDs are used in RR fashion. The
early
versions don't support LMC.

--
                  Gleb.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
                       Gleb.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
                       Gleb.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

Reply via email to