> -----Original Message----- > From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com] > Sent: Friday, November 3, 2017 5:35 AM > To: Björn Töpel <bjorn.to...@gmail.com> > Cc: Karlsson, Magnus <magnus.karls...@intel.com>; Duyck, Alexander H > <alexander.h.du...@intel.com>; Alexander Duyck > <alexander.du...@gmail.com>; John Fastabend > <john.fastab...@gmail.com>; Alexei Starovoitov <a...@fb.com>; Jesper > Dangaard Brouer <bro...@redhat.com>; michael.lundkv...@ericsson.com; > ravineet.si...@ericsson.com; Daniel Borkmann <dan...@iogearbox.net>; > Network Development <netdev@vger.kernel.org>; Topel, Bjorn > <bjorn.to...@intel.com>; Brandeburg, Jesse > <jesse.brandeb...@intel.com>; Singhai, Anjali <anjali.sing...@intel.com>; > Rosen, Rami <rami.ro...@intel.com>; Shaw, Jeffrey B > <jeffrey.b.s...@intel.com>; Yigit, Ferruh <ferruh.yi...@intel.com>; Zhang, > Qi Z <qi.z.zh...@intel.com> > Subject: Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support > > On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.to...@gmail.com> > wrote: > > From: Björn Töpel <bjorn.to...@intel.com> > > > > This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are > > optimized for high performance packet processing and zero-copy > > semantics. Throughput improvements can be up to 40x compared to V2 > and > > V3 for the micro benchmarks included. Would be great to get your > > feedback on it. > > > > The main difference between V4 and V2/V3 is that TX and RX descriptors > > are separated from packet buffers. > > Cool feature. I'm looking forward to the netdev talk. Aside from the inline > comments in the patches, a few architecture questions.
Glad to hear. Are you going to Netdev in Seoul? If so, let us hook up and discuss your comments in further detail. Some initial thoughts below. > Is TX support needed? Existing PACKET_TX_RING already sends out packets > without copying directly from the tx_ring. Indirection through a descriptor > ring is not helpful on TX if all packets still have to come from a > pre-registered > packet pool. The patch set adds a lot of tx-only code and is complex enough > without it. That is correct, but what if the packet you are going to transmit came in from the receive path and is already in the packet buffer? This might happen if the application is examining/sniffing packets then sending them out, or doing some modification to them. In that case we avoid a copy in V4 since the packet is already in the packet buffer. With V2 and V3, a copy from the RX ring to the TX ring would be needed. In the PACKET_ZEROCOPY case, avoiding this copy increases performance quite a lot. > Can you use the existing PACKET_V2 format for the packet pool? The > v4 format is nearly the same as V2. Using the same version might avoid some > code duplication and simplify upgrading existing legacy code. > Instead of continuing to add new versions whose behavior is implicit, > perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2. Interesting idea that I think is worth thinking more about. One problem though with the V2 ring format model, and the current V4 format too by the way, when applied to a user-space allocated memory is that they are symmetric, i.e. that user space and kernel space have to produce and consume the same amount of entries (within the length of the descriptor area). User space sends down a buffer entry that the kernel fills in for RX for example. Symmetric queues do not work when you have a shared packet buffer between two processes. (This is not a requirement, but someone might do a mmap with MAP_SHARED for the packet buffer and then fork of a child that then inherits this packet buffer.) One of the processes might just receive packets, while the other one is transmitting. Or if you have a veth link pair between two processes and they have been set up to share packet buffer area. With a symmetric queue you have to copy even if they share the same packet buffer, but with an asymmetric queue, you do not and the driver only needs to copy the packet buffer id between the TX desc ring of the sender to the RX desc ring of the receiver, not the data. I think this gives an indication that we need a new structure. Anyway, I like your idea and I think it is worth thinking more about it. Let us have a discussion about this at Netdev, if you are there. > Finally, is it necessary to define a new descriptor ring format? Same for the > packet array and frame set. The kernel already has a few, such as virtio for > the first, skb_array/ptr_ring, even linux list for the second. These > containers > add a lot of new boilerplate code. If new formats are absolutely necessary, at > least we should consider making them generic (like skb_array and ptr_ring). > But I'd like to understand first why, e.g., virtio cannot be used. Agree with you. Good if we can use something existing. The descriptor format of V4 was based on one of the first Virtio 1.1 proposal by Michael Tsirkin (tools/virtio/ringtest/ring.c). Then we have diverged somewhat due to performance reasons and Virtio 1.1 has done the same but in another direction. We should take a look at the latest Virtio 1.1 proposal again and see what it offers. The reason we did not go with Virtio 0.9 was for performance. Too many indirections, something that the people behind Virtio 1.1 had identified too. With ptr_ring, how do we deal with the pointers in the structure as this now has to go to user-space? In any way, we would like to have a ring structure that is asymmetric for the reasons above. Other than that, we would not mind using anything as long as it is fast. If it already exists, perfect. /Magnus