I just realized that I replied directly to Matthew and not to the list.

Let me add most of my reply to the thread here on the list, in case it's 
helpful to others.  See my reply to Matthew, below.


> On Mar 5, 2016, at 10:53 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> 
> Let me try to explain a little better; I apparently didn't do a good job the 
> first time around...
> 
>> On Tuesday, March 1, 2016 9:54 PM, Jeff Squyres (jsquyres) 
>> <jsquy...@cisco.com> wrote:
>> 
>>> 1. Dolphin's implementation of IPoPCIe 
>> 
>>> If it provides a TCP stack and an IP interface, you should be able to use 
>>> Open MPI's TCP BTL interface >over it.
>> 
>> Dolphin provides an optimized TCP/IP driver for IPoPCIe. Where can I learn 
>> about Open MPI's TCP BTL interface? I have looked at Open MPI website, but 
>> there is such a vast amount of knowledge, I cannot fully utilize in such a 
>> short amount of time.
> 
> There isn't actually much to know about Open MPI's TCP interface.  Let me 
> summarize:
> 
> - Open MPI has 2 main forms of communication:
> 1. Run-time / control traffic
> 2. MPI traffic
> 
> - In many cases, Open MPI will use TCP for #1 and whatever flavor of 
> high-performance network you have for #2 (this is a generalization, and not 
> always true, but "true enough" for the purposes of this conversation).
> 
> - For MPI traffic, Open MPI will basically probe all the machines that it is 
> running on when your MPI processes start up.  It will discover all high speed 
> / HPC-capable network interfaces, and use the "best" ones.  
> - For example, if OMPI finds both IP interfaces and InifniBand devices, it 
> will prefer the InfiniBand devices because these are typically higher 
> performance than the OS IP interfaces.
> - Many non-Ethernet devices present themselves as two different interfaces to 
> the OS: one is the "native" device, the other is an emulation later that 
> provides a normal IP (and usually also TCP) interface to the OS.
> - The "native" interfaces use custom network APIs -- i.e., not the POSIX 
> sockets API. These customer network APIs tend to be highly efficient and 
> customized to the underlying network device.
>   --> Open MPI understands and can utilize many different vendor "native" 
> network APIs.  They tend to give the best performance on these high speed HPC 
> networks.
> - The "emulated" IP/TCP interfaces, by definition, add overhead to the native 
> interface.  They typically allow applications to use the POSIX sockets API, 
> and therefore can be used by many different applications (not just MPI 
> implementations that have been coded to use the custom network APIs of the 
> native interfaces of these HPC-capable networks).  While these IP emulation 
> interfaces are less efficient than the "native" interfaces, that's usually ok 
> because their goal is to enable more applications to use them (via the POSIX 
> sockets API), not to provide the same level of performance as the native APIs.
> 
> - As a general rule of thumb, Open MPI will prefer "native" interfaces over 
> any network interface that provides the POSIX sockets API.  This generally 
> allows Open MPI to pick the highest-performing method for accessing the 
> network as possible.  That being said, if Open MPI doesn't find any "native" 
> interfaces, it will generally use any/all IP interfaces that it finds.
> - Open MPI does not have direct support for Dolphin PCIoE APIs, so it will 
> likely only be able to access your network via the POSIX sockets API.  Hence, 
> it will likely use TCP (i.e., as you mentioned, the Dolphin IPoPCIe emulation 
> layer).
> 
> - Using the Open MPI TCP interface *should* be largely automatic.  Open MPI 
> should just find your IP interfaces and decide to use them.
> - Open MPI has some configuration knobs for its TCP support, but you probably 
> want want or need to configure any of them.
> - That being said, there's one exception: since your machines likely have 
> more than one IP interface (e.g., the real Ethernet device and your Dolphin 
> IPoPCIe interface), you might want to tell Open MPI to restrict itself to the 
> Dolphin IPoPCIe interface.  Otherwise, Open MPI will use *all* your Ethernet 
> devices (e.g., stripe large messages across all your IP interfaces).
> - The easiest way to do limit Open MPI's selection of IP interfaces is:
> 
>   mpirun --mca btl_tcp_if_include NAME_OF_DOLPHIN_INTERFACE ...
> 
> That is, supply the "btl_tcp_if_include" ("if" = "interface") MCA parameter 
> on the mpirun command line. This parameter tells Open MPI to *only* use these 
> IP interfaces for MPI traffic.  For example, this would limit Open MPI to 
> only use eth4:
> 
>   mpirun --mca btl_tcp_if_include eth4 ...
> 
> So instead of "eth4", put whatever the name of your Dolphin interface is.  
> Then Open MPI will use *only* that IP interface (on every machine) for MPI 
> traffic.
> 
> Does that help?
> 
>> What does the term "native verbs" mean, along with what do you mean by 
>> "native" interface? I am assuming you mean use something that directly 
>> interfaces with the RDMA network?
> 
> "Verbs" is the name of the native Mellanox network APIs to access their IB 
> cards (specifically, the Linux library name is "libibverbs"; most people just 
> say "verbs").
> 
> Verbs has lots of different communication modes; some of them do simple sends 
> and receives of network messages across IB, others do RDMA operations across 
> IB.
> 
>> I am fairly new to networks other than the TCP/IP stack used in most OS, so 
>> my first instinct is to use as TCP as it is all I am familiar with.
> 
> Don't worry: this is a common issue with many who are starting out in the HPC 
> world.  The learning curve for non-Ethernet networks can be a little steep if 
> you've been working with Ethernet for a long time.


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to