Hi, Actually I was wondering why there is a facility for having multiple LIDs for the same port. This led me to the entire series of questions. It is still not very clear to, as to what is the advantage of assigning multiple LIDs to the same port. Does it give some performance advantages? -Chev
On 12/5/06, Jeff Squyres <jsquy...@cisco.com> wrote:
There are two distinct layers of software being discussed here: - the PML (basically the back-end to MPI_SEND and friends) - the BTL (byte transfer layer, the back-end bit movers for the ob1 and dr PMLs -- this distinction is important because there is nothing in the PML design that forces the use of BTL's; indeed, there is at least one current PML that does not use BTL's as the back-end bit mover [the cm PML]) The ob1 and dr PMLs know nothing about how the back-end bitmovers work (BTL components) -- the BTLs are given considerable freedom to operate within their specific interface contracts. Generally, ob1/dr queries each BTL component when Open MPI starts up. Each BTL responds with whether it wants to run or not. If it does, it gives back the one or more modules (think of a module as an "instance" of a component). Typically, these modules correspond to multiple NICs / HCAs / network endpoints. For example, if you have 2 ethernet cards, the tcp BTL will create and return 2 modules. ob1 / dr will treat these as two paths to send data (reachability is computed as well, of course -- ob1/dr will only send data down btls for which the target peer is reachable). In general, ob1/dr will round-robin across all available BTL modules when sending large messages (as Gleb has described). See http://www.open-mpi.org/papers/ euro-pvmmpi-2006-hpc-protocols/ for a general description of the ob1/ dr protocols. The openib BTL can return multiple modules if multiple LIDs are available. So the ob1/dr doesn't know that these are not physical devices -- it just treats each module as an equivalent mechanism to send data. This is actually somewhat lame as a scheme, and we talked internally about doing something more intelligent. But we decided to hold off and let people (like you!) with real-world apps and networks give this stuff a try and see what really works (and what doesn't work) before trying to implement anything else. So -- all that explanation aside -- we'd love to hear your feedback with regards to the multi-LID stuff in Open MPI. :-) On Dec 4, 2006, at 1:27 PM, Chevchenkovic Chevchenkovic wrote: > Thanks for that. > > Suppose, if there there are multiple interconnects, say ethernet and > infiniband and a million byte of data is to be sent, then in this > case the data will be sent through infiniband (since its a fast path > .. please correct me here if i m wrong). > > If there are mulitple such sends, do you mean to say that each send > will go through different BTLs in a RR manner if they are connected > to the same port? > > -chev > > > On 12/4/06, Gleb Natapov <gl...@voltaire.com> wrote: >> On Mon, Dec 04, 2006 at 10:53:26PM +0530, Chevchenkovic >> Chevchenkovic wrote: >>> Hi, >>> It is not clear from the code as mentioned by you from >>> ompi/mca/pml/ob1/ where exactly the selection of BTL bound to a >>> particular LID occurs. Could you please specify the file/function >>> name >>> for the same? >> There is no such code there. OB1 knows nothing about LIDs. It does RR >> over all available interconnects. It can do RR between ethernet, IB >> and Myrinet for instance. BTL presents each LID as different >> virtual HCA >> to OB1 and it does round-robin between them without event knowing >> this >> is the same port of the same HCA. >> >> Can you explain what are you trying to achieve? >> >>> -chev >>> >>> >>> On 12/4/06, Gleb Natapov <gl...@voltaire.com> wrote: >>>> On Mon, Dec 04, 2006 at 01:07:08AM +0530, Chevchenkovic >>>> Chevchenkovic wrote: >>>>> Also could you please tell me which part of the openMPI code >>>>> needs to >>>>> be touched so that I can do some modifications in it to >>>>> incorporate >>>>> changes regarding LID selection... >>>>> >>>> It depend what do you want to do. The part that does RR over all >>>> available LIDs is in OB1 PML (ompi/mca/pml/ob1/), but the code >>>> doesn't >>>> aware of the fact that it is doing RR over different LIDs and not >>>> different NICs (yet?). >>>> >>>> The code that controls what LIDs will be used is in >>>> ompi/mca/btl/openib/btl_openib_component.c. >>>> >>>>> On 12/4/06, Chevchenkovic Chevchenkovic >>>>> <chevchenko...@gmail.com> wrote: >>>>>> Is it possible to control the LID where the send and recvs are >>>>>> posted.. on either ends? >>>>>> >>>>>> On 12/3/06, Gleb Natapov <gl...@voltaire.com> wrote: >>>>>>> On Sun, Dec 03, 2006 at 07:03:33PM +0530, Chevchenkovic >>>>>>> Chevchenkovic >>>>>> wrote: >>>>>>>> Hi, >>>>>>>> I had this query. I hope some expert replies to it. >>>>>>>> I have 2 nodes connected point-to-point using infiniband >>>>>>>> cable. There >>>>>>>> are multiple LIDs for each of the end node ports. >>>>>>>> When I give an MPI_Send, Are the sends are posted on >>>>>>>> different LIDs >>>>>>>> on each of the end nodes OR they are they posted on the same >>>>>>>> LID? >>>>>>>> Awaiting your reply, >>>>>>> It depend what version of Open MPI your are using. If you are >>>>>>> using >>>>>>> trunk or v1.2 beta then all available LIDs are used in RR >>>>>>> fashion. The >>>>>> early >>>>>>> versions don't support LMC. >>>>>>> >>>>>>> -- >>>>>>> Gleb. >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> -- >>>> Gleb. >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> -- >> Gleb. >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Server Virtualization Business Unit Cisco Systems _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users