On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.to...@gmail.com> wrote: > From: Björn Töpel <bjorn.to...@intel.com> > > This RFC introduces a new address family called AF_XDP that is > optimized for high performance packet processing and, in upcoming > patch sets, zero-copy semantics.
Overall, this looks really nice! > In this v2 version, we have removed > all zero-copy related code in order to make it smaller, simpler and > hopefully more review friendly. This RFC only supports copy-mode for > the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX > using the XDP_DRV path. Please remove references to RFC when resending to bpf-next. > An AF_XDP socket (XSK) is created with the normal socket() > syscall. Associated with each XSK are two queues: the RX queue and the > TX queue. A socket can receive packets on the RX queue and it can send > packets on the TX queue. These queues are registered and sized with > the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is > mandatory to have at least one of these queues for each socket. In > contrast to AF_PACKET V2/V3 these descriptor queues are separated from > packet buffers. An RX or TX descriptor points to a data buffer in a > memory area called a UMEM. RX and TX can share the same UMEM so that a > packet does not have to be copied between RX and TX. Moreover, if a > packet needs to be kept for a while due to a possible retransmit, the > descriptor that points to that packet can be changed to point to > another and reused right away. This again avoids copying data. > > This new dedicated packet buffer area is call a UMEM. It consists of a > number of equally size frames and each frame has a unique frame id. A > descriptor in one of the queues references a frame by referencing its > frame id. The user space allocates memory for this UMEM using whatever > means it feels is most appropriate (malloc, mmap, huge pages, > etc). This memory area is then registered with the kernel using the new > setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue > and the COMPLETION queue. The fill queue is used by the application to > send down frame ids for the kernel to fill in with RX packet > data. References to these frames will then appear in the RX queue of > the XSK once they have been received. The completion queue, on the > other hand, contains frame ids that the kernel has transmitted > completely and can now be used again by user space, for either TX or > RX. Thus, the frame ids appearing in the completion queue are ids that > were previously transmitted using the TX queue. In summary, the RX and > FILL queues are used for the RX path and the TX and COMPLETION queues > are used for the TX path. > > The socket is then finally bound with a bind() call to a device and a > specific queue id on that device, The setup involves a lot of system calls. You may want to require the caller to take these in a well defined order, and same for destruction. Arbitrary order leads to a state explosion in paths through the code. With AF_PACKET we've had to fix quite a few bugs due to unexpected states of the socket, e.g., on teardown, and it is too late now to restrict the number of states.