Hopefully some of the other developers will correct me if I am wrong..
Brock Palen wrote:
I had a user ask this, its not a very practical question but I am
curious.
This is good information for the archives :)
OMPI uses a 'fast' network if its available. (IB, GM, etc) I also
infer that for process in the same SMP nodes the sm (shared memory)
btl is used, even if the job has more than one node given to it? The
real question is what happens if a job is given three nodes, two have
IB adapters and all have ethernet. Will the entire job use TCP for
process on different nodes and shared memory inner node? Or will the
two that have ib connections use ib to communicate and only use TCP
when talking to the third host that does not have IB.
You infer correctly - sm is just considered to be another network we
support.
The two nodes with IB will use IB to communicate with each other, and
ethernet (TCP) to communicate with the third node that lacks IB. This
works the same for shared memory - MPI processes on the same node will
use SM to communicate, and use say IB or TCP to communicate off-node.
Second would it be safe to say OMPI searches the BTL's in the
following order when trying to reach a process?
Self
SM
IB, GM, MX, MVAPI
TCP
Actually, each BTL has an exclusivity value that we use to choose which
BTL is given preference when we have several BTLs available for
communication. A quick grep shows you're pretty much right on:
$ ompi_info --all|grep exclusivity
MCA btl: parameter "btl_openib_exclusivity" (current value: "1024")
MCA btl: parameter "btl_self_exclusivity" (current value: "65536")
MCA btl: parameter "btl_sm_exclusivity" (current value: "65535")
MCA btl: parameter "btl_tcp_exclusivity" (current value: "0")
These of course can be tuned, though expect trouble if you give
something higher exclusivity than self. These numbers have no real
meaning other than their relation to one another. For example, changing
openib's exclusivity to 65000 won't change when/how it is used among the
BTLs I have above, though it would possibly change relative to
GM/MX/MVAPI if they're present.
Third, what about a hypothetical case when a node has both GM and IB
on it? (evaluation machines)
(This is where I might be wrong) The network with the highest
exclusivity is used for sending of eager messages and the initial part
of large messages using rendezvous protocol. Beyond that, large message
data is striped across all available BTLs for more bandwidth.
You probably know already that the 'btl' MCA parameter can be used to
select a set of BTLs at runtime, ie to just use IB (and self).
Last does OMPI build something like a route list when mpi_init() is
called? This way knowing how to get to other members off the job?
Or is this done the first time a process needs to talk to another
process thus not storing any route information not needed.
Yes - at initialization time (and when processes are dynamically added),
each BTL is responsible for determining which other processes it can
communicate with. This information is pushed back up to the higher
levels (BML/PML) for use in scheduling decisions.
However, those BTLs that communicate over point-to-point connection
pairs do not establish connections until data needs to be sent (lazy
connection establishment). This way we do not immediately set up N^2
connections, but instead only as each pairwise communication path is used.
The route information consumes relatively few resources compared to all
the buffers and handles that must be allocated for connections in most
of the BTLs.
p.s. not having to recompile code for different networks has made
evaluating network so much more enjoyable. Thank-you for all the
work on the selection of networks 'just working'
That was our goal, stuff should just work. Glad you appreciate it as
much as we do.
Andrew