On Mar 1, 2016, at 10:25 PM, dpchoudh . <dpcho...@gmail.com> wrote:
> 
> <quote>
> I don't think the Open MPI TCP BTL will pass the SDP socket type when 
> creating sockets -- SDP is much lower performance than native verbs/RDMA.  
> You should use a "native" interface to your RDMA network instead (which one 
> you use depends on which kind of network you have).
> </quote>
> 
> I have a rather naive follow-up question along this line: why is there not a 
> native mode for (garden variety) Ethernet?

There's at least three things that Ethernet-based networks do for acceleration 
/ low latency:

1. Bypass the OS for injecting and receiving network packets
2. Use a wire protocol other than TCP
3. Include other offload functionality (e.g., RDMA, or RDMA-like capabilities)

Enabling these things typically requires additional support from the NIC's 
drivers and/or firmware.  Hence, you typically can't just take any old Ethernet 
NIC and expect that the above three things work.

Several Ethernet NIC vendors have enabled these kinds of things in their NICs 
(e.g., I am on the usNIC team at Cisco, where we enable these things on the 
Cisco NIC in our UCS server line).

There was a project a few years ago called OpenMX that used the generic 
Ethernet driver in Linux to accomplish #2 for just about any Ethernet NIC, but 
it never really caught on, and has since bit-rotted.

> Is it because it lacks the end-to-end guarantees of TCP, Infiniband and the 
> like? These days, switched Ethernet is very reliable, isn't it? (I mean by 
> the rate of packet drop because of congestion). So if the application only 
> needs data chunks of around 8KB max, which would not need to be fragmented 
> (using jumbo frames), won't a native ethernet be much more efficient?

The Cisco usNIC stack was initially OS-bypass injection of simple L2 Ethernet 
frames.  It did all of its own retransmission and whatnot in Open MPI itself 
(*all* network types have drops and/or frame corruption, due to congestion and 
lots of other every day kinds of traffic management -- *some* layer in the 
network has to handle such drops/retransmits if you want them to look like they 
are reliable to a higher level in the stack).  

We eventually "upgraded" usNIC to the UDP wire protocol because our customers 
told us that they want to switch usNIC traffic around L3 networks in their 
datacenter.  We typically use jumbo frames to get good bandwidth.  The addition 
of a few bytes per packet (i.e., the size comparison of a raw L2 ethernet frame 
vs. a UDP packet) is typically not enough to affect the bandwidth curve for 
large packets -- especially when using jumbo frames.  Additionally, Cisco gear 
switches L2 and L3 packets at exactly the same speed, so we don't lose any 
native fabric performance by upgrading from L2 frames to UDP packets.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to