Re: [OMPI users] network preference

Andrew Friedley Thu, 10 Aug 2006 17:04:23 -0400

Hopefully some of the other developers will correct me if I am wrong..


Brock Palen wrote:

I had a user ask this, its not a very practical question but I amcurious.


This is good information for the archives :)

OMPI uses a 'fast' network if its available. (IB, GM, etc) I alsoinfer that for process in the same SMP nodes the sm (shared memory)btl is used, even if the job has more than one node given to it? Thereal question is what happens if a job is given three nodes, two haveIB adapters and all have ethernet. Will the entire job use TCP forprocess on different nodes and shared memory inner node? Or will thetwo that have ib connections use ib to communicate and only use TCPwhen talking to the third host that does not have IB.

You infer correctly - sm is just considered to be another network wesupport.

The two nodes with IB will use IB to communicate with each other, andethernet (TCP) to communicate with the third node that lacks IB. Thisworks the same for shared memory - MPI processes on the same node willuse SM to communicate, and use say IB or TCP to communicate off-node.

Second would it be safe to say OMPI searches the BTL's in thefollowing order when trying to reach a process?
Self
SM
IB, GM, MX, MVAPI
TCP

Actually, each BTL has an exclusivity value that we use to choose whichBTL is given preference when we have several BTLs available forcommunication. A quick grep shows you're pretty much right on:


$ ompi_info --all|grep exclusivity
 MCA btl: parameter "btl_openib_exclusivity" (current value: "1024")
 MCA btl: parameter "btl_self_exclusivity" (current value: "65536")
 MCA btl: parameter "btl_sm_exclusivity" (current value: "65535")
 MCA btl: parameter "btl_tcp_exclusivity" (current value: "0")

These of course can be tuned, though expect trouble if you givesomething higher exclusivity than self. These numbers have no realmeaning other than their relation to one another. For example, changingopenib's exclusivity to 65000 won't change when/how it is used among theBTLs I have above, though it would possibly change relative toGM/MX/MVAPI if they're present.

Third, what about a hypothetical case when a node has both GM and IBon it? (evaluation machines)

(This is where I might be wrong) The network with the highestexclusivity is used for sending of eager messages and the initial partof large messages using rendezvous protocol. Beyond that, large messagedata is striped across all available BTLs for more bandwidth.

You probably know already that the 'btl' MCA parameter can be used toselect a set of BTLs at runtime, ie to just use IB (and self).

Last does OMPI build something like a route list when mpi_init() iscalled? This way knowing how to get to other members off the job?Or is this done the first time a process needs to talk to anotherprocess thus not storing any route information not needed.

Yes - at initialization time (and when processes are dynamically added),each BTL is responsible for determining which other processes it cancommunicate with. This information is pushed back up to the higherlevels (BML/PML) for use in scheduling decisions.

However, those BTLs that communicate over point-to-point connectionpairs do not establish connections until data needs to be sent (lazyconnection establishment). This way we do not immediately set up N^2connections, but instead only as each pairwise communication path is used.

The route information consumes relatively few resources compared to allthe buffers and handles that must be allocated for connections in mostof the BTLs.

p.s. not having to recompile code for different networks has madeevaluating network so much more enjoyable. Thank-you for all thework on the selection of networks 'just working'

That was our goal, stuff should just work. Glad you appreciate it asmuch as we do.


Andrew

Re: [OMPI users] network preference

Reply via email to