> Listen, maybe this sounds too dumb from my part but, it is the first > time I'm talking about this stuff (like "NIC peer-into GRE" ?, or GRO > / CKO...
No worries. So, a slightly brief history of stateless offloads in NICs. It may be too basic, and I may get some details wrong, but it should give the gist. Go back to the "old days" - 10 Mbit/s Ethernet was "it" (all you Token Ring fans can keep quiet :). Systems got faster than 10 Mbit/s. By a fair margin. 100 BT came out and it wasn't all that long before systems were faster than that, but things like interrupt rates were starting to get to be an issue for performance, so 100 BT NICs started implementing interrupt avoidance heuristics. The next bump in network speed to 1000 Mbit/s managed to get well out ahead of the systems. All this time, while the link speeds were increasing, the IEEE was doing little to nothing to make sending and receiving Ethernet traffic any easier on the end stations (eg increasing the MTU). It was taking just as many CPU cycles to send/receive a frame over 1000BT as it did over 100BT as it did over 10BT. <insert segque about how FDDI was doing things to make life easier, as well as what the FDDI NIC vendors were doing to enable copy-free networking, here> So the Ethernet NIC vendors started getting creative and started borrowing some techniques from FDDI. The base of it all is CKO - ChecKsum Offload. Offloading the checksum calculation for the TCP and UDP checksums. In broad handwaving terms, for inbound packets, the NIC is made either smart enough to recognize an incoming frame as TCP segment (UDP datagram) or it performs the Internet Checksum across the entire frame and leaves it to the driver to fixup. For outbound traffic, the stack, via the driver, tells the NIC a starting value (perhaps), where to start computing the checksum, how far to go, and where to stick it... So, we can save the CPU cycles used calculating/verifying the checksums. In rough terms, in the presence of copies, that is perhaps 10% or 15% savings. Systems still needed more. It was just as many trips up and down the protocol stack in the host to send a MB of data as it was before - the IEEE hanging-on to the 1500 byte MTU. So, some NIC vendors came-up with Jumbo Frames - I think the first may have been Alteon and their AceNICs and switches. A 9000 byte MTU allows one to send bulk data across the network in ~1/6 the number of trips up and down the protocol stack. But that has problems - in particular you have to have support for Jumbo Frames from end to end. So someone, I don't recall who, had the flash of inspiration - What If... the NIC could perform the TCP segmentation on behalf of the stack? When sending a big chunk of data over TCP in one direction, the only things which change from TCP segment to TCP segment are the sequence number, and the checksum <insert some handwaving about the IP datagram ID here>. The NIC already knows how to compute the checksum, so let's teach it how to very simply increment the TCP sequence number. Now we can give it A Lot of Data (tm) in one trip down the protocol stack and save even more CPU cycles than Jumbo Frames. Now the NIC has to know a little bit more about the traffic - it has to know that it is TCP so it can know where the TCP sequence number goes. We also tell it the MSS to use when it is doing the segmentation on our behalf. Thus was born TCP Segmentation Offload, aka TSO or "Poor Man's Jumbo Frames" That works pretty well for servers at the time - they tend to send more data than they receive. The clients receiving the data don't need to be able to keep up at 1000 Mbit/s and the server can be sending to multiple clients. However, we get another order of magnitude bump in link speeds, to 10000 Mbit/s. Now people need/want to receive at the higher speeds too. So some 10 Gbit/s NIC vendors come up with the mirror image of TSO and call it LRO - Large Receive Offload. The LRO NIC will coalesce several, consequtive TCP segments into one uber segment and hand that to the host. There are some "issues" with LRO though - for example when a system is acting as a router, so in Linux, and perhaps other stacks, LRO is taken out of the hands of the NIC and given to the stack in the form of 'GRO" - Generic Receive Offload. GRO operates above the NIC/driver, but below IP. It detects the consecutive segments and coalesces them before passing them further up the stack. It becomes possible to receive data at link-rate over 10 GbE. All is happiness and joy. OK, so now we have all these "stateless" offloads that know about the basic traffic flow. They are all built on the foundation of CKO. They are all dealing with *un* encapsulated traffic. (They also don't to anything for small packets.) Now, toss-in some encapsulation. Take your pick, in the abstract it doesn't really matter which I suspect, at least for a little longer. What is arriving at the NIC on inbound is no longer a TCP segment in an IP datagram in an Ethernet frame, it is all that wrapped-up in the encapsulation protocol. Unless the NIC knows about the encapsulation protocol, all the NIC knows it has is some slightly alien packet. It will probably know it is IP, but it won't know more than that. It could, perhaps, simply compute an Internet Checksum across the entire IP datagram and leave it to the driver to fix-up. It could simply punt and not perform any CKO at all. But CKO is the foundation of the stateless offloads. So, certainly no LRO and (I think but could be wrong) no GRO. (At least not until the Linux stack learns how to look beyond the encapsulation headers.) Similarly, consider the outbound path. We could change the constants we tell the NIC for doing CKO perhaps, but unless it knows about the encapsulation protocol, we cannot ask it to do the TCP segmentation of TSO - it would have to start replicating not only the TCP and IP headers, but also the headers of the encapsulation protocol. So, there goes TSO. In essence, using an encapsulation protocol takes us all the way back to the days of 100BT in so far as stateless offloads are concerned. Perhaps to the early days of 1000BT. We do have a bit more CPU grunt these days, but for the last several years that has come primarily in the form of more cores per processor, not in the form of processors with higher and higher frequencies. In broad handwaving terms, single-threaded performance is not growing all that much. If at all. That is why we have things like multiple queues per NIC port now and Receive Side Scaling (RSS) or Receive Packet Scaling/Receive Flow Scaling in Linux (or Inbound Packet Scheduling/Thread Optimized Packet Scheduling in HP-UX etc etc). RSS works by having the NIC compute a hash over selected headers of the arriving packet - perhaps the source and destination MAC addresses, perhaps the source and destination IP addresses, and perhaps the source and destination TCP ports. But now the arrving traffic is all wrapped up in this encapsulation protocol that the NIC might not know about. Over what should the NIC compute the hash with which to pick the queue that then picks the CPU to interrupt? It may just punt and send all the traffic up one queue. There are similar sorts of hashes being computed at either end of a bond/aggregate/trunk. And the switches or bonding drivers making those calculations may not know about the encapsulation protocol, so they may not be able to spread traffic across multiple links. The information they used to use is now hidden from them by the encapsulation protocol. That then is what I was getting at when talking about NICs peering into GRE. rick jones All I want for Christmas is a 32 bit VLAN ID and NICs and switches which understand it... :) _______________________________________________ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : openstack@lists.openstack.org Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack