On Wed, Dec 06, 2006 at 12:14:35PM +0530, Chevchenkovic Chevchenkovic wrote: > Hi, > Actually I was wondering why there is a facility for having multiple > LIDs for the same port. This led me to the entire series of questions. > It is still not very clear to, as to what is the advantage of > assigning multiple LIDs to the same port. Does it give some > performance advantages? Each LID has its own path through the fabric (ideally), this is the way to lower a congestion.
> -Chev > > > On 12/5/06, Jeff Squyres <jsquy...@cisco.com> wrote: > > There are two distinct layers of software being discussed here: > > > > - the PML (basically the back-end to MPI_SEND and friends) > > - the BTL (byte transfer layer, the back-end bit movers for the ob1 > > and dr PMLs -- this distinction is important because there is nothing > > in the PML design that forces the use of BTL's; indeed, there is at > > least one current PML that does not use BTL's as the back-end bit > > mover [the cm PML]) > > > > The ob1 and dr PMLs know nothing about how the back-end bitmovers > > work (BTL components) -- the BTLs are given considerable freedom to > > operate within their specific interface contracts. > > > > Generally, ob1/dr queries each BTL component when Open MPI starts > > up. Each BTL responds with whether it wants to run or not. If it > > does, it gives back the one or more modules (think of a module as an > > "instance" of a component). Typically, these modules correspond to > > multiple NICs / HCAs / network endpoints. For example, if you have 2 > > ethernet cards, the tcp BTL will create and return 2 modules. ob1 / > > dr will treat these as two paths to send data (reachability is > > computed as well, of course -- ob1/dr will only send data down btls > > for which the target peer is reachable). In general, ob1/dr will > > round-robin across all available BTL modules when sending large > > messages (as Gleb has described). See http://www.open-mpi.org/papers/ > > euro-pvmmpi-2006-hpc-protocols/ for a general description of the ob1/ > > dr protocols. > > > > The openib BTL can return multiple modules if multiple LIDs are > > available. So the ob1/dr doesn't know that these are not physical > > devices -- it just treats each module as an equivalent mechanism to > > send data. > > > > This is actually somewhat lame as a scheme, and we talked internally > > about doing something more intelligent. But we decided to hold off > > and let people (like you!) with real-world apps and networks give > > this stuff a try and see what really works (and what doesn't work) > > before trying to implement anything else. > > > > So -- all that explanation aside -- we'd love to hear your feedback > > with regards to the multi-LID stuff in Open MPI. :-) > > > > > > > > On Dec 4, 2006, at 1:27 PM, Chevchenkovic Chevchenkovic wrote: > > > > > Thanks for that. > > > > > > Suppose, if there there are multiple interconnects, say ethernet and > > > infiniband and a million byte of data is to be sent, then in this > > > case the data will be sent through infiniband (since its a fast path > > > .. please correct me here if i m wrong). > > > > > > If there are mulitple such sends, do you mean to say that each send > > > will go through different BTLs in a RR manner if they are connected > > > to the same port? > > > > > > -chev > > > > > > > > > On 12/4/06, Gleb Natapov <gl...@voltaire.com> wrote: > > >> On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic > > >> Chevchenkovic wrote: > > >>> Hi, > > >>> It is not clear from the code as mentioned by you from > > >>> ompi/mca/pml/ob1/ where exactly the selection of BTL bound to a > > >>> particular LID occurs. Could you please specify the file/function > > >>> name > > >>> for the same? > > >> There is no such code there. OB1 knows nothing about LIDs. It does RR > > >> over all available interconnects. It can do RR between ethernet, IB > > >> and Myrinet for instance. BTL presents each LID as different > > >> virtual HCA > > >> to OB1 and it does round-robin between them without event knowing > > >> this > > >> is the same port of the same HCA. > > >> > > >> Can you explain what are you trying to achieve? > > >> > > >>> -chev > > >>> > > >>> > > >>> On 12/4/06, Gleb Natapov <gl...@voltaire.com> wrote: > > >>>> On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic > > >>>> Chevchenkovic wrote: > > >>>>> Also could you please tell me which part of the openMPI code > > >>>>> needs to > > >>>>> be touched so that I can do some modifications in it to > > >>>>> incorporate > > >>>>> changes regarding LID selection... > > >>>>> > > >>>> It depend what do you want to do. The part that does RR over all > > >>>> available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code > > >>>> doesn't > > >>>> aware of the fact that it is doing RR over different LIDs and not > > >>>> different NICs (yet?). > > >>>> > > >>>> The code that controls what LIDs will be used is in > > >>>> ompi/mca/btl/openib/btl_openib_component.c. > > >>>> > > >>>>> On 12/4/06, Chevchenkovic Chevchenkovic > > >>>>> <chevchenko...@gmail.com> wrote: > > >>>>>> Is it possible to control the LID where the send and recvs are > > >>>>>> posted.. on either ends? > > >>>>>> > > >>>>>> On 12/3/06, Gleb Natapov <gl...@voltaire.com> wrote: > > >>>>>>> On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic > > >>>>>>> Chevchenkovic > > >>>>>> wrote: > > >>>>>>>> Hi, > > >>>>>>>> I had this query. I hope some expert replies to it. > > >>>>>>>> I have 2 nodes connected point-to-point using infiniband > > >>>>>>>> cable. There > > >>>>>>>> are multiple LIDs for each of the end node ports. > > >>>>>>>> When I give an MPI_Send, Are the sends are posted on > > >>>>>>>> different LIDs > > >>>>>>>> on each of the end nodes OR they are they posted on the same > > >>>>>>>> LID? > > >>>>>>>> Awaiting your reply, > > >>>>>>> It depend what version of Open MPI your are using. If you are > > >>>>>>> using > > >>>>>>> trunk or v1.2 beta then all available LIDs are used in RR > > >>>>>>> fashion. The > > >>>>>> early > > >>>>>>> versions don't support LMC. > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Gleb. > > >>>>>>> _______________________________________________ > > >>>>>>> users mailing list > > >>>>>>> us...@open-mpi.org > > >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>>>>> > > >>>>>> > > >>>>> _______________________________________________ > > >>>>> users mailing list > > >>>>> us...@open-mpi.org > > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> > > >>>> -- > > >>>> Gleb. > > >>>> _______________________________________________ > > >>>> users mailing list > > >>>> us...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >>>> > > >>> _______________________________________________ > > >>> users mailing list > > >>> us...@open-mpi.org > > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > >> -- > > >> Gleb. > > >> _______________________________________________ > > >> users mailing list > > >> us...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >> > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > Server Virtualization Business Unit > > Cisco Systems > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Gleb.