Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

Doug Hughes Tue, 16 Apr 2013 09:55:53 -0700

some of these points are a bit dated. Allow me to make some updates. I'm sure 
that you are aware that most 10gig switches these days are cut through and not 
store and forward. That's Arista, HP, Dell Force10, Mellanox, and IBM/Blade. 
Cisco has a mix of things, but they aren't really in the low latency space. The 
10g and 40g port to port forwarding is in nanoseconds. buffering is mostly 
reserved to carrier operations anymore, and even there it is becoming less 
common because of the toll it causes to things like IPVideo and VOIP. Buffers 
are good for web farms, still, and to a certain extent storage servers or WAN 
links where there is a high degree of contention from disparate traffic.
  
At a physical level, the signalling of IB compared to Ethernet (10g+) is very 
similar, which is why Mellanox can make a single chip that does 10gbit 40gbit, 
and QDR and FDR infiniband on any port.
 there are also a fair number of vendors that support RDMA in ethernet NIC now, 
like SolarFlare with Onboot technology.


The main reason for lowest achievable latency is higher speed. Latency is 
roughly equivalent to the inversion of bandwidth.  But, the higher levels of 
protocols that you stack on top contribute much more than the hardware 
theoretical minimums or maximums. TCP/IP is a killer in terms of adding 
overhead. That's why there are protocols like ISER, SRP, and friends. RDMA is 
much faster than the kernel overhead induced by TCP session setups and other 
host side user/kernel boundaries and buffering. PCI latency is also higher than 
the port to port latency on a good 10g switch, nevermind 40 or FDR infiniband.

There is even a special layer that you can write custom protocols to on 
Infiniband called Verbs for lowering latency further.

Infiniband is inherently a layer1 and 2 protocol, and the subnet manager 
(software) is resposible for setting up all virtual circuits (routes between 
hosts on the fabric) and rerouting when a path goes bad. Also, the link 
aggregation, as you mention, is rock solid and amazingly good. Auto rerouting 
is fabulous and super fast. But, you don't get layer3. TCP over IB works out of 
the box, but adds large overhead. Still, it does make it possible that you can 
have IB native and IP over IB with gateways to a TCP network with a single 
cable. That's pretty cool.


Sent from my android device.

-----Original Message-----
From: "Edward Ned Harvey (openindiana)" <openindi...@nedharvey.com>
To: Discussion list for OpenIndiana <openindiana-discuss@openindiana.org>
Sent: Tue, 16 Apr 2013 10:49 AM
Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage 
(OpenIndiana-discuss Digest, Vol 33, Issue 20)

> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
> 
> It would be difficult to believe that 10Gbit Ethernet offers better
> bandwidth than 56Gbit Infiniband (the current offering).  The swiching
> model is quite similar.  The main reason why IB offers better latency
> is a better HBA hardware interface and a specialized stack.  5X is 5X.

Put another way, the reason infiniband is so much higher throughput and lower 
latency than ethernet is because the switching (at the physical layer) is 
completely different from ethernet, and messages are passed directly from 
user-level to user-level on remote system ram via RDMA, bypassing the OSI layer 
model and other kernel overhead.  I read a paper from vmware, where they 
implemented RDMA over ethernet and doubled the speed of vmotion (but still not 
as fast as infiniband, by like 4x.)

Beside the bypassing of OSI layers and kernel latency, IB latency is lower 
because Ethernet switches use store-and-forward buffering managed by the 
backplane in the switch, in which a sender sends a packet to a buffer on the 
switch, which then pushes it through the backplane, and finally to another 
buffer on the destination.  IB uses cross-bar, or cut-through switching, in 
which the sending host channel adapter signals the destination address to the 
switch, then waits for the channel to be opened.  Once the channel is opened, 
it stays open, and the switch in between is nothing but signal amplification 
(as well as additional virtual lanes for congestion management, and other 
functions).  The sender writes directly to RAM on the destination via RDMA, no 
buffering in between.  Bypassing the OSI layer model.  Hence much lower latency.

IB also has native link aggregation into data-striped lanes, hence the 1x, 2x, 
4x, 16x designations, and the 40Gbit specifications.  Something which is 
quasi-possible in ethernet via LACP, but not as good and not the same.  IB 
guarantees packets delivered in the right order, with native congestion control 
as compared to ethernet which may drop packets and TCP must detect and 
retransmit...  

Ethernet includes a lot of support for IP addressing, and variable link speeds 
(some 10Gbit, 10/100, 1G etc) and all of this asynchronous.  For these reasons, 
IB is not a suitable replacement for IP communications done on ethernet, with a 
lot of variable peer-to-peer and broadcast traffic.  IB is designed for 
networks where systems want to establish connections to other systems, and 
those connections remain mostly statically connected.  Primarily clustering & 
storage networks.  Not primarily TCP/IP.


_______________________________________________
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss
_______________________________________________
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss

Re: [OpenIndiana-discuss] Recommendations for fast storage (OpenIndiana-discuss Digest, Vol 33, Issue 20)

Reply via email to