WOW!! Thank you for your time Rick! Awesome answer!! =D I'll do this tests (with ethtool GRO / CKO) tonight but, do you think that this is the main root of the problem?!
I mean, I'm seeing two distinct problems here: 1- Slow connectivity to the External network plus SSH lags all over the cloud (everything that pass trough L3 / Namespace is problematic), and; 2- Communication between two Instances on different hypervisors (i.e. maybe it is related to this GRO / CKO thing). So, two different problems, right?! Thanks! Thiago On 25 October 2013 18:56, Rick Jones <rick.jon...@hp.com> wrote: > > Listen, maybe this sounds too dumb from my part but, it is the first > > time I'm talking about this stuff (like "NIC peer-into GRE" ?, or GRO > > / CKO... > > No worries. > > So, a slightly brief history of stateless offloads in NICs. It may be > too basic, and I may get some details wrong, but it should give the gist. > > Go back to the "old days" - 10 Mbit/s Ethernet was "it" (all you Token > Ring fans can keep quiet :). Systems got faster than 10 Mbit/s. By a > fair margin. 100 BT came out and it wasn't all that long before systems > were faster than that, but things like interrupt rates were starting to > get to be an issue for performance, so 100 BT NICs started implementing > interrupt avoidance heuristics. The next bump in network speed to 1000 > Mbit/s managed to get well out ahead of the systems. All this time, > while the link speeds were increasing, the IEEE was doing little to > nothing to make sending and receiving Ethernet traffic any easier on the > end stations (eg increasing the MTU). It was taking just as many CPU > cycles to send/receive a frame over 1000BT as it did over 100BT as it > did over 10BT. > > <insert segque about how FDDI was doing things to make life easier, as > well as what the FDDI NIC vendors were doing to enable copy-free > networking, here> > > So the Ethernet NIC vendors started getting creative and started > borrowing some techniques from FDDI. The base of it all is CKO - > ChecKsum Offload. Offloading the checksum calculation for the TCP and > UDP checksums. In broad handwaving terms, for inbound packets, the NIC > is made either smart enough to recognize an incoming frame as TCP > segment (UDP datagram) or it performs the Internet Checksum across the > entire frame and leaves it to the driver to fixup. For outbound > traffic, the stack, via the driver, tells the NIC a starting value > (perhaps), where to start computing the checksum, how far to go, and > where to stick it... > > So, we can save the CPU cycles used calculating/verifying the checksums. > In rough terms, in the presence of copies, that is perhaps 10% or 15% > savings. Systems still needed more. It was just as many trips up and > down the protocol stack in the host to send a MB of data as it was > before - the IEEE hanging-on to the 1500 byte MTU. So, some NIC vendors > came-up with Jumbo Frames - I think the first may have been Alteon and > their AceNICs and switches. A 9000 byte MTU allows one to send bulk > data across the network in ~1/6 the number of trips up and down the > protocol stack. But that has problems - in particular you have to have > support for Jumbo Frames from end to end. > > So someone, I don't recall who, had the flash of inspiration - What > If... the NIC could perform the TCP segmentation on behalf of the > stack? When sending a big chunk of data over TCP in one direction, the > only things which change from TCP segment to TCP segment are the > sequence number, and the checksum <insert some handwaving about the IP > datagram ID here>. The NIC already knows how to compute the checksum, > so let's teach it how to very simply increment the TCP sequence number. > Now we can give it A Lot of Data (tm) in one trip down the protocol > stack and save even more CPU cycles than Jumbo Frames. Now the NIC has > to know a little bit more about the traffic - it has to know that it is > TCP so it can know where the TCP sequence number goes. We also tell it > the MSS to use when it is doing the segmentation on our behalf. Thus > was born TCP Segmentation Offload, aka TSO or "Poor Man's Jumbo Frames" > > That works pretty well for servers at the time - they tend to send more > data than they receive. The clients receiving the data don't need to be > able to keep up at 1000 Mbit/s and the server can be sending to multiple > clients. However, we get another order of magnitude bump in link > speeds, to 10000 Mbit/s. Now people need/want to receive at the higher > speeds too. So some 10 Gbit/s NIC vendors come up with the mirror image > of TSO and call it LRO - Large Receive Offload. The LRO NIC will > coalesce several, consequtive TCP segments into one uber segment and > hand that to the host. There are some "issues" with LRO though - for > example when a system is acting as a router, so in Linux, and perhaps > other stacks, LRO is taken out of the hands of the NIC and given to the > stack in the form of 'GRO" - Generic Receive Offload. GRO operates > above the NIC/driver, but below IP. It detects the consecutive > segments and coalesces them before passing them further up the stack. It > becomes possible to receive data at link-rate over 10 GbE. All is > happiness and joy. > > OK, so now we have all these "stateless" offloads that know about the > basic traffic flow. They are all built on the foundation of CKO. They > are all dealing with *un* encapsulated traffic. (They also don't to > anything for small packets.) > > Now, toss-in some encapsulation. Take your pick, in the abstract it > doesn't really matter which I suspect, at least for a little longer. > What is arriving at the NIC on inbound is no longer a TCP segment in an > IP datagram in an Ethernet frame, it is all that wrapped-up in the > encapsulation protocol. Unless the NIC knows about the encapsulation > protocol, all the NIC knows it has is some slightly alien packet. It > will probably know it is IP, but it won't know more than that. > > It could, perhaps, simply compute an Internet Checksum across the entire > IP datagram and leave it to the driver to fix-up. It could simply punt > and not perform any CKO at all. But CKO is the foundation of the > stateless offloads. So, certainly no LRO and (I think but could be > wrong) no GRO. (At least not until the Linux stack learns how to look > beyond the encapsulation headers.) > > Similarly, consider the outbound path. We could change the constants we > tell the NIC for doing CKO perhaps, but unless it knows about the > encapsulation protocol, we cannot ask it to do the TCP segmentation of > TSO - it would have to start replicating not only the TCP and IP > headers, but also the headers of the encapsulation protocol. So, there > goes TSO. > > In essence, using an encapsulation protocol takes us all the way back to > the days of 100BT in so far as stateless offloads are concerned. > Perhaps to the early days of 1000BT. > > We do have a bit more CPU grunt these days, but for the last several > years that has come primarily in the form of more cores per processor, > not in the form of processors with higher and higher frequencies. In > broad handwaving terms, single-threaded performance is not growing all > that much. If at all. > > That is why we have things like multiple queues per NIC port now and > Receive Side Scaling (RSS) or Receive Packet Scaling/Receive Flow > Scaling in Linux (or Inbound Packet Scheduling/Thread Optimized Packet > Scheduling in HP-UX etc etc). RSS works by having the NIC compute a > hash over selected headers of the arriving packet - perhaps the source > and destination MAC addresses, perhaps the source and destination IP > addresses, and perhaps the source and destination TCP ports. But now > the arrving traffic is all wrapped up in this encapsulation protocol > that the NIC might not know about. Over what should the NIC compute the > hash with which to pick the queue that then picks the CPU to interrupt? > It may just punt and send all the traffic up one queue. > > There are similar sorts of hashes being computed at either end of a > bond/aggregate/trunk. And the switches or bonding drivers making those > calculations may not know about the encapsulation protocol, so they may > not be able to spread traffic across multiple links. The information > they used to use is now hidden from them by the encapsulation protocol. > > That then is what I was getting at when talking about NICs peering into > GRE. > > rick jones > All I want for Christmas is a 32 bit VLAN ID and NICs and switches which > understand it... :) >
_______________________________________________ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack