There are two distinct layers of software being discussed here:
- the PML (basically the back-end to MPI_SEND and friends)
- the BTL (byte transfer layer, the back-end bit movers for the ob1
and dr PMLs -- this distinction is important because there is nothing
in the PML design that forces the use of BTL's; indeed, there is at
least one current PML that does not use BTL's as the back-end bit
mover [the cm PML])
The ob1 and dr PMLs know nothing about how the back-end bitmovers
work (BTL components) -- the BTLs are given considerable freedom to
operate within their specific interface contracts.
Generally, ob1/dr queries each BTL component when Open MPI starts
up. Each BTL responds with whether it wants to run or not. If it
does, it gives back the one or more modules (think of a module as an
"instance" of a component). Typically, these modules correspond to
multiple NICs / HCAs / network endpoints. For example, if you have 2
ethernet cards, the tcp BTL will create and return 2 modules. ob1 /
dr will treat these as two paths to send data (reachability is
computed as well, of course -- ob1/dr will only send data down btls
for which the target peer is reachable). In general, ob1/dr will
round-robin across all available BTL modules when sending large
messages (as Gleb has described). See http://www.open-mpi.org/papers/
euro-pvmmpi-2006-hpc-protocols/ for a general description of the ob1/
dr protocols.
The openib BTL can return multiple modules if multiple LIDs are
available. So the ob1/dr doesn't know that these are not physical
devices -- it just treats each module as an equivalent mechanism to
send data.
This is actually somewhat lame as a scheme, and we talked internally
about doing something more intelligent. But we decided to hold off
and let people (like you!) with real-world apps and networks give
this stuff a try and see what really works (and what doesn't work)
before trying to implement anything else.
So -- all that explanation aside -- we'd love to hear your feedback
with regards to the multi-LID stuff in Open MPI. :-)
On Dec 4, 2006, at 1:27 PM, Chevchenkovic Chevchenkovic wrote:
Thanks for that.
Suppose, if there there are multiple interconnects, say ethernet and
infiniband and a million byte of data is to be sent, then in this
case the data will be sent through infiniband (since its a fast path
.. please correct me here if i m wrong).
If there are mulitple such sends, do you mean to say that each send
will go through different BTLs in a RR manner if they are connected
to the same port?
-chev
On 12/4/06, Gleb Natapov <gl...@voltaire.com> wrote:
On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic
Chevchenkovic wrote:
Hi,
It is not clear from the code as mentioned by you from
ompi/mca/pml/ob1/ where exactly the selection of BTL bound to a
particular LID occurs. Could you please specify the file/function
name
for the same?
There is no such code there. OB1 knows nothing about LIDs. It does RR
over all available interconnects. It can do RR between ethernet, IB
and Myrinet for instance. BTL presents each LID as different
virtual HCA
to OB1 and it does round-robin between them without event knowing
this
is the same port of the same HCA.
Can you explain what are you trying to achieve?
-chev
On 12/4/06, Gleb Natapov <gl...@voltaire.com> wrote:
On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic
Chevchenkovic wrote:
Also could you please tell me which part of the openMPI code
needs to
be touched so that I can do some modifications in it to
incorporate
changes regarding LID selection...
It depend what do you want to do. The part that does RR over all
available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code
doesn't
aware of the fact that it is doing RR over different LIDs and not
different NICs (yet?).
The code that controls what LIDs will be used is in
ompi/mca/btl/openib/btl_openib_component.c.
On 12/4/06, Chevchenkovic Chevchenkovic
<chevchenko...@gmail.com> wrote:
Is it possible to control the LID where the send and recvs are
posted.. on either ends?
On 12/3/06, Gleb Natapov <gl...@voltaire.com> wrote:
On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic
Chevchenkovic
wrote:
Hi,
I had this query. I hope some expert replies to it.
I have 2 nodes connected point-to-point using infiniband
cable. There
are multiple LIDs for each of the end node ports.
When I give an MPI_Send, Are the sends are posted on
different LIDs
on each of the end nodes OR they are they posted on the same
LID?
Awaiting your reply,
It depend what version of Open MPI your are using. If you are
using
trunk or v1.2 beta then all available LIDs are used in RR
fashion. The
early
versions don't support LMC.
--
Gleb.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Gleb.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Gleb.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems