On Mar 1, 2016, at 10:25 PM, dpchoudh . <dpcho...@gmail.com> wrote: > > <quote> > I don't think the Open MPI TCP BTL will pass the SDP socket type when > creating sockets -- SDP is much lower performance than native verbs/RDMA. > You should use a "native" interface to your RDMA network instead (which one > you use depends on which kind of network you have). > </quote> > > I have a rather naive follow-up question along this line: why is there not a > native mode for (garden variety) Ethernet?
There's at least three things that Ethernet-based networks do for acceleration / low latency: 1. Bypass the OS for injecting and receiving network packets 2. Use a wire protocol other than TCP 3. Include other offload functionality (e.g., RDMA, or RDMA-like capabilities) Enabling these things typically requires additional support from the NIC's drivers and/or firmware. Hence, you typically can't just take any old Ethernet NIC and expect that the above three things work. Several Ethernet NIC vendors have enabled these kinds of things in their NICs (e.g., I am on the usNIC team at Cisco, where we enable these things on the Cisco NIC in our UCS server line). There was a project a few years ago called OpenMX that used the generic Ethernet driver in Linux to accomplish #2 for just about any Ethernet NIC, but it never really caught on, and has since bit-rotted. > Is it because it lacks the end-to-end guarantees of TCP, Infiniband and the > like? These days, switched Ethernet is very reliable, isn't it? (I mean by > the rate of packet drop because of congestion). So if the application only > needs data chunks of around 8KB max, which would not need to be fragmented > (using jumbo frames), won't a native ethernet be much more efficient? The Cisco usNIC stack was initially OS-bypass injection of simple L2 Ethernet frames. It did all of its own retransmission and whatnot in Open MPI itself (*all* network types have drops and/or frame corruption, due to congestion and lots of other every day kinds of traffic management -- *some* layer in the network has to handle such drops/retransmits if you want them to look like they are reliable to a higher level in the stack). We eventually "upgraded" usNIC to the UDP wire protocol because our customers told us that they want to switch usNIC traffic around L3 networks in their datacenter. We typically use jumbo frames to get good bandwidth. The addition of a few bytes per packet (i.e., the size comparison of a raw L2 ethernet frame vs. a UDP packet) is typically not enough to affect the bandwidth curve for large packets -- especially when using jumbo frames. Additionally, Cisco gear switches L2 and L3 packets at exactly the same speed, so we don't lose any native fabric performance by upgrading from L2 frames to UDP packets. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/