On Tue, 24 Jan 2017 20:09:07 +0000 "Wiles, Keith" <keith.wi...@intel.com> wrote:
> > On Jan 24, 2017, at 12:45 PM, Ananyev, Konstantin > > <konstantin.anan...@intel.com> wrote: > > > > > > > >> -----Original Message----- > >> From: Wiles, Keith > >> Sent: Tuesday, January 24, 2017 2:49 PM > >> To: Ananyev, Konstantin <konstantin.anan...@intel.com> > >> Cc: Stephen Hemminger <step...@networkplumber.org>; Hu, Jiayu > >> <jiayu...@intel.com>; dev@dpdk.org; Kinsella, Ray > >> <ray.kinse...@intel.com>; Gilmore, Walter E <walter.e.gilm...@intel.com>; > >> Venkatesan, Venky <venky.venkate...@intel.com>; > >> yuanhan....@linux.intel.com > >> Subject: Re: [dpdk-dev] [RFC] Add GRO support in DPDK > >> > >> > >>> On Jan 24, 2017, at 3:33 AM, Ananyev, Konstantin > >>> <konstantin.anan...@intel.com> wrote: > >>> > >>> > >>> > >>>> -----Original Message----- > >>>> From: Wiles, Keith > >>>> Sent: Tuesday, January 24, 2017 5:26 AM > >>>> To: Ananyev, Konstantin <konstantin.anan...@intel.com> > >>>> Cc: Stephen Hemminger <step...@networkplumber.org>; Hu, Jiayu > >>>> <jiayu...@intel.com>; dev@dpdk.org; Kinsella, Ray > >>>> <ray.kinse...@intel.com>; Gilmore, Walter E > >>>> <walter.e.gilm...@intel.com>; Venkatesan, Venky > >>>> <venky.venkate...@intel.com>; > >>>> yuanhan....@linux.intel.com > >>>> Subject: Re: [dpdk-dev] [RFC] Add GRO support in DPDK > >>>> > >>>> > >>>>> On Jan 23, 2017, at 6:43 PM, Ananyev, Konstantin > >>>>> <konstantin.anan...@intel.com> wrote: > >>>>> > >>>>> > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Wiles, Keith > >>>>>> Sent: Monday, January 23, 2017 9:53 PM > >>>>>> To: Stephen Hemminger <step...@networkplumber.org> > >>>>>> Cc: Hu, Jiayu <jiayu...@intel.com>; dev@dpdk.org; Kinsella, Ray > >>>>>> <ray.kinse...@intel.com>; Ananyev, Konstantin > >>>>>> <konstantin.anan...@intel.com>; Gilmore, Walter E > >>>>>> <walter.e.gilm...@intel.com>; Venkatesan, Venky > >>>> <venky.venkate...@intel.com>; > >>>>>> yuanhan....@linux.intel.com > >>>>>> Subject: Re: [dpdk-dev] [RFC] Add GRO support in DPDK > >>>>>> > >>>>>> > >>>>>>> On Jan 23, 2017, at 10:15 AM, Stephen Hemminger > >>>>>>> <step...@networkplumber.org> wrote: > >>>>>>> > >>>>>>> On Mon, 23 Jan 2017 21:03:12 +0800 > >>>>>>> Jiayu Hu <jiayu...@intel.com> wrote: > >>>>>>> > >>>>>>>> With the support of hardware segmentation techniques in DPDK, the > >>>>>>>> networking stack overheads of send-side of applications, which > >>>>>>>> directly > >>>>>>>> leverage DPDK, have been greatly reduced. But for receive-side, > >>>>>>>> numbers of > >>>>>>>> segmented packets seriously burden the networking stack of > >>>>>>>> applications. > >>>>>>>> Generic Receive Offload (GRO) is a widely used method to solve the > >>>>>>>> receive-side issue, which gains performance by reducing the amount of > >>>>>>>> packets processed by the networking stack. But currently, DPDK > >>>>>>>> doesn't > >>>>>>>> support GRO. Therefore, we propose to add GRO support in DPDK, and > >>>>>>>> this > >>>>>>>> RFC is used to explain the basic DPDK GRO design. > >>>>>>>> > >>>>>>>> DPDK GRO is a SW-based packets assembly library, which provides GRO > >>>>>>>> abilities for numbers of protocols. In DPDK GRO, packets are merged > >>>>>>>> before returning to applications and after receiving from drivers. > >>>>>>>> > >>>>>>>> In DPDK, GRO is a capability of NIC drivers. That support GRO or not > >>>>>>>> and > >>>>>>>> what GRO types are supported are up to NIC drivers. Different > >>>>>>>> drivers may > >>>>>>>> support different GRO types. By default, drivers enable all > >>>>>>>> supported GRO > >>>>>>>> types. For applications, they can inquire the supported GRO types by > >>>>>>>> each driver, and can control what GRO types are applied. For example, > >>>>>>>> ixgbe supports TCP and UDP GRO, but the application just needs TCP > >>>>>>>> GRO. > >>>>>>>> The application can disable ixgbe UDP GRO. > >>>>>>>> > >>>>>>>> To support GRO, a driver should provide a way to tell applications > >>>>>>>> what > >>>>>>>> GRO types are supported, and provides a GRO function, which is in > >>>>>>>> charge > >>>>>>>> of assembling packets. Since different drivers may support different > >>>>>>>> GRO > >>>>>>>> types, their GRO functions may be different. For applications, they > >>>>>>>> don't > >>>>>>>> need extra operations to enable GRO. But if there are some GRO types > >>>>>>>> that > >>>>>>>> are not needed, applications can use an API, like > >>>>>>>> rte_eth_gro_disable_protocols, to disable them. Besides, they can > >>>>>>>> re-enable the disabled ones. > >>>>>>>> > >>>>>>>> The GRO function processes numbers of packets at a time. In each > >>>>>>>> invocation, what GRO types are applied depends on applications, and > >>>>>>>> the > >>>>>>>> amount of packets to merge depends on the networking status and > >>>>>>>> applications. Specifically, applications determine the maximum > >>>>>>>> number of > >>>>>>>> packets to be processed by the GRO function, but how many packets are > >>>>>>>> actually processed depends on if there are available packets to > >>>>>>>> receive. > >>>>>>>> For example, the receive-side application asks the GRO function to > >>>>>>>> process 64 packets, but the sender only sends 40 packets. At this > >>>>>>>> time, > >>>>>>>> the GRO function returns after processing 40 packets. To reassemble > >>>>>>>> the > >>>>>>>> given packets, the GRO function performs an "assembly procedure" on > >>>>>>>> each > >>>>>>>> packet. We use an example to demonstrate this procedure. Supposing > >>>>>>>> the > >>>>>>>> GRO function is going to process packetX, it will do the following > >>>>>>>> two > >>>>>>>> things: > >>>>>>>> a. Find a L4 assembly function according to the packet type of > >>>>>>>> packetX. A L4 assembly function is in charge of merging packets > >>>>>>>> of a > >>>>>>>> specific type. For example, TCPv4 assembly function merges > >>>>>>>> packets > >>>>>>>> whose L3 IPv4 and L4 is TCP. Each L4 assembly function has a > >>>>>>>> packet > >>>>>>>> array, which keeps the packets that are unable to assemble. > >>>>>>>> Initially, the packet array is empty; > >>>>>>>> b. The L4 assembly function traverses own packet array to find a > >>>>>>>> mergeable packet (comparing Ethernet, IP and L4 header fields). > >>>>>>>> If > >>>>>>>> finds, merges it and packetX via chaining them together; if > >>>>>>>> doesn't, > >>>>>>>> allocates a new array element to store packetX and updates > >>>>>>>> element > >>>>>>>> number of the array. > >>>>>>>> After performing the assembly procedure to all packets, the GRO > >>>>>>>> function > >>>>>>>> combines the results of all packet arrays, and returns these packets > >>>>>>>> to > >>>>>>>> applications. > >>>>>>>> > >>>>>>>> There are lots of ways to implement the above design in DPDK. One of > >>>>>>>> the > >>>>>>>> ways is: > >>>>>>>> a. Drivers tell applications what GRO types are supported via > >>>>>>>> dev->dev_ops->dev_infos_get; > >>>>>>>> b. When initialize, drivers register own GRO function as a RX > >>>>>>>> callback, which is invoked inside rte_eth_rx_burst. The name of > >>>>>>>> the > >>>>>>>> GRO function should be like xxx_gro_receive (e.g. > >>>>>>>> ixgbe_gro_receive). > >>>>>>>> Currently, the RX callback can only process the packets > >>>>>>>> returned by > >>>>>>>> dev->rx_pkt_burst each time, and the maximum packet number > >>>>>>>> dev->rx_pkt_burst returns is determined by each driver, which > >>>>>>>> can't > >>>>>>>> be interfered by applications. Therefore, to implement the > >>>>>>>> above GRO > >>>>>>>> design, we have to modify current RX implementation to make > >>>>>>>> driver > >>>>>>>> return packets as many as possible until the packet number > >>>>>>>> meets the > >>>>>>>> demand of applications or there are not available packets to > >>>>>>>> receive. > >>>>>>>> This modification is also proposed in patch: > >>>>>>>> http://dpdk.org/ml/archives/dev/2017-January/055887.html; > >>>>>>>> c. The GRO types to apply and the maximum number of packets to > >>>>>>>> merge > >>>>>>>> are passed by resetting RX callback parameters. It can be > >>>>>>>> achieved by > >>>>>>>> invoking rte_eth_rx_callback; > >>>>>>>> d. Simply, we can just store packet addresses into the packet > >>>>>>>> array. > >>>>>>>> To check one element, we need to fetch the packet via its > >>>>>>>> address. > >>>>>>>> However, this simple design is not efficient enough. Since > >>>>>>>> whenever > >>>>>>>> checking one packet, one pointer dereference is generated. And a > >>>>>>>> pointer dereference always causes a cache line miss. A better > >>>>>>>> way is > >>>>>>>> to store some rules in each array element. The rules must be the > >>>>>>>> prerequisites of merging two packets, like the sequence number > >>>>>>>> of TCP > >>>>>>>> packets. We first compare the rules, then retrieve the packet > >>>>>>>> if the > >>>>>>>> rules match. If storing the rules causes the packet array > >>>>>>>> structure > >>>>>>>> is cache-unfriendly, we can store a fixed-length signature of > >>>>>>>> the > >>>>>>>> rules instead. For example, the signature can be calculated by > >>>>>>>> performing XOR operation on IP addresses. Both design can avoid > >>>>>>>> unnecessary pointer dereferences. > >>>>>>> > >>>>>>> > >>>>>>> Since DPDK does burst mode already, GRO is a lot less relevant. > >>>>>>> GRO in Linux was invented because there is no burst mode in the > >>>>>>> receive API. > >>>>>>> > >>>>>>> If you look at VPP in FD.io you will see they already do aggregration > >>>>>>> and > >>>>>>> steering at the higher level in the stack. > >>>>>>> > >>>>>>> The point of GRO is that it is generic, no driver changes are > >>>>>>> necessary. > >>>>>>> Your proposal would add a lot of overhead, and cause drivers to have > >>>>>>> to > >>>>>>> be aware of higher level flows. > >>>>>> > >>>>>> NACK > >>>>>> > >>>>>> The design is not super clear to me here and we need to understand the > >>>>>> impact to DPDK, performance and the application. I would > >> like > >>>> to > >>>>>> have a clean transparent design to the application and as little > >>>>>> impact on performance as possible. > >>>>>> > >>>>>> Let discuss this as I am not sure my previous concerns were addressed > >>>>>> in this RFC. > >>>>>> > >>>>> > >>>>> I would agree that design looks overcomplicated and strange: > >>>>> If GRO can (and supposed to be) done fully in SW, why do we need to > >>>>> modify PMDs at all, > >>>>> why it can't be just a standalone DPDK library that user can use on > >>>>> his/her convenience? > >>>>> I'd suggest to start with some simple and most widespread case (TCP?) > >>>>> and try to implement > >>>>> a library for it first: something similar to what we have for ip > >>>>> reassembly. > >>>> > >>>> The reason this should not be a library the application calls is to > >>>> allow for a transparent design for HW and SW support of this feature. > >> Using > >>>> the SW version the application should not need to understand (other then > >>>> performance) that GRO is being done for this port. > >>>> > >>> > >>> Why is that? > >>> Let say we have ip reassembly library that is called explicitly by the > >>> application. > >>> I think for L4 grouping we can do the same. > >>> After all it is a pure SW feature, so to me it makes sense to allow > >>> application to decide > >>> when/where to call it. > >>> Again it would allow people to develop/use it without any modifications > >>> in current PMDs. > >> > >> I guess I did not make it clear, we need to support HW and this SW version > >> transparently just as we handle other features in HW/SW under a > >> generic API for DPDK. > > > > Ok, I probably wasn't very clear too. > > What I meant: > > Let's try to implement GRO (in SW) as a standalone DPDK library, > > with clean & simple interface and see how fast and useful it would be. > > We can refer to it as step 1. > > When (if) we'll have step 1 in place, then we can start thinking > > about adding combined HW/SW solution for it (step 2). > > I think at that stage it would be much clearer: > > is there any point in it at all, > > and if yes, how it should be done: > > -changes at rte_ethedev or on PMD layers or both > > - would changes at rte_ethdev API be needed and if yes what particular, > > etc. > > > > From my perspective, without step 1 in place, there is no much point in > > approaching step 2. > > Currently I believe they have a SW library version of the code, but I think > we need to look at the design in that form. At this time the current design > or code is not what I would expect needs to be done for the transparent > version. To many interactions with the application and a separate Rx/Tx > functions were being used (If I remember correctly) > > > > > BTW, any particular HW you have in mind? > > Currently, as I can see LRO (HW) is supported only by ixgbe and probably by > > viritual PMDs (virtio/vmxent3). > > Though even for ixgbe there are plenty of limitations: SRIOV mode should be > > off, HW CRC stropping should be off, etc. > > So my guess, right now step 1 is much more useful and feasible. > > > >> > >>> > >>>> As I was told the Linux kernel hides this features and make it > >>>> transparent. > >>> > >>> Yes, but DPDK does a lot things in a different way. > >>> So it doesn't look like a compelling reason for me :) > >> > >> Just looking at different options here and it is a compelling reason to me > >> as it enforces the design can be transparent to the application. > >> Having the application in a NFV deciding on hw or sw or both is not a good > >> place to put that logic IMO. > > > > Actually could you provide an example of linux NIC driver, that uses HW > > offloads (and which) to implement GRO? > > I presume some might use HW generated hashes, but apart from that, when HW > > performs actual packet grouping? > > From what I've seen Intel ones rely SW implementation for that. > > But I am not a linux/GRO expert, so feel free to correct me here. > > Konstantin > > Regards, > Keith > Linux uses a push (rather than DPDK pull) model for packet receiving. The Linux driver pushes packets into GRO by calling napi_gro_receive. Since DPDK is pull model the API would be simpler. it could be as simple as: nb = rte_eth_rx_burst(port, rx_pkts, N); nb = rte_rx_gro(port, rx_pkts, gro_pkts, nb); I agree with others, look at ip reassembly library as example. Also, GRO does not make sense for applications which already do the same vector flow processing like VPP which is one reason it should be optional.