On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.to...@gmail.com> wrote: > From: Björn Töpel <bjorn.to...@intel.com> > > This RFC introduces a new address family called AF_XDP that is > optimized for high performance packet processing and zero-copy > semantics. Throughput improvements can be up to 20x compared to V2 and > V3 for the micro benchmarks included. Would be great to get your > feedback on it. Note that this is the follow up RFC to AF_PACKET V4 > from November last year. The feedback from that RFC submission and the > presentation at NetdevConf in Seoul was to create a new address family > instead of building on top of AF_PACKET. AF_XDP is this new address > family. > > The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor > level is that TX and RX descriptors are separated from packet > buffers. An RX or TX descriptor points to a data buffer in a packet > buffer area. RX and TX can share the same packet buffer so that a > packet does not have to be copied between RX and TX. Moreover, if a > packet needs to be kept for a while due to a possible retransmit, then > the descriptor that points to that packet buffer can be changed to > point to another buffer and reused right away. This again avoids > copying data. > > The RX and TX descriptor rings are registered with the setsockopts > XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer > area is allocated by user space and registered with the kernel using > the new XDP_MEM_REG setsockopt. All these three areas are shared > between user space and kernel space. The socket is then bound with a > bind() call to a device and a specific queue id on that device, and it > is not until bind is completed that traffic starts to flow. > > An XDP program can be loaded to direct part of the traffic on that > device and queue id to user space through a new redirect action in an > XDP program called bpf_xdpsk_redirect that redirects a packet up to > the socket in user space. All the other XDP actions work just as > before. Note that the current RFC requires the user to load an XDP > program to get any traffic to user space (for example all traffic to > user space with the one-liner program "return > bpf_xdpsk_redirect();"). We plan on introducing a patch that removes > this requirement and sends all traffic from a queue to user space if > an AF_XDP socket is bound to it. > > AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and > XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there > is no specific mode called XDP_DRV_ZC). If the driver does not have > support for XDP, or XDP_SKB is explicitly chosen when loading the XDP > program, XDP_SKB mode is employed that uses SKBs together with the > generic XDP support and copies out the data to user space. A fallback > mode that works for any network device. On the other hand, if the > driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and > ndo_xdp_flush), these NDOs, without any modifications, will be used by > the AF_XDP code to provide better performance, but there is still a > copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP > driver support with the zero-copy user space allocator that provides > even better performance. In this mode, the networking HW (or SW driver > if it is a virtual driver like veth) DMAs/puts packets straight into > the packet buffer that is shared between user space and kernel > space. The RX and TX descriptor queues of the networking HW are NOT > shared to user space. Only the kernel can read and write these and it > is the kernel driver's responsibility to translate these HW specific > descriptors to the HW agnostic ones in the virtual descriptor rings > that user space sees. This way, a malicious user space program cannot > mess with the networking HW. This mode though requires some extensions > to XDP. > > To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a > buffer pool concept so that the same XDP driver code can be used for > buffers allocated using the page allocator (XDP_DRV), the user-space > zero-copy allocator (XDP_DRV_ZC), or some internal driver specific > allocator/cache/recycling mechanism. The ndo_bpf call has also been > extended with two commands for registering and unregistering an XSK > socket and is in the RX case mainly used to communicate some > information about the user-space buffer pool to the driver. > > For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush, > but we run into problems with this (further discussion in the > challenges section) and had to introduce a new NDO called > ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice > and an explicit queue id that packets should be sent out on. In > contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be > sent from the xdp socket (associated with the dev and queue > combination that was provided with the NDO call) using a callback > (get_tx_packet), and when they have been transmitted it uses another > callback (tx_completion) to signal completion of packets. These > callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK > command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code > and thus does not clash with the XDP_REDIRECT use of > ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode > (without ZC) is currently not supported by TX. Please have a look at > the challenges section for further discussions. > > The AF_XDP bind call acts on a queue pair (channel in ethtool speak), > so the user needs to steer the traffic to the zero-copy enabled queue > pair. Which queue to use, is up to the user. > > For an untrusted application, HW packet steering to a specific queue > pair (the one associated with the application) is a requirement, as > the application would otherwise be able to see other user space > processes' packets. If the HW cannot support the required packet > steering, XDP_DRV or XDP_SKB mode have to be used as they do not > expose the NIC's packet buffer into user space as the packets are > copied into user space from the NIC's packet buffer in the kernel. > > There is a xdpsock benchmarking/test application included. Say that > you would like your UDP traffic from port 4242 to end up in queue 16, > that we will enable AF_XDP on. Here, we use ethtool for this: > > ethtool -N p3p2 rx-flow-hash udp4 fn > ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ > action 16 > > Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using: > > samples/bpf/xdpsock -i p3p2 -q 16 -l -N > > For XDP_SKB mode, use the switch "-S" instead of "-N" and all options > can be displayed with "-h", as usual. > > We have run some benchmarks on a dual socket system with two Broadwell > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 > cores which gives a total of 28, but only two cores are used in these > experiments. One for TR/RX and one for the user space application. The > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total > memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an > Intel I40E 40Gbit/s using the i40e driver. > > Below are the results in Mpps of the I40E NIC benchmark runs for 64 > byte packets, generated by commercial packet generator HW that is > generating packets at full 40 Gbit/s line rate. > > XDP baseline numbers without this RFC: > xdp_rxq_info --action XDP_DROP 31.3 Mpps > xdp_rxq_info --action XDP_TX 16.7 Mpps > > XDP performance with this RFC i.e. with the buffer allocator: > XDP_DROP 21.0 Mpps > XDP_TX 11.9 Mpps > > AF_PACKET V4 performance from previous RFC on 4.14-rc7: > Benchmark V2 V3 V4 V4+ZC > rxdrop 0.67 0.73 0.74 33.7 > txpush 0.98 0.98 0.91 19.6 > l2fwd 0.66 0.71 0.67 15.5 > > AF_XDP performance: > Benchmark XDP_SKB XDP_DRV XDP_DRV_ZC (all in Mpps) > rxdrop 3.3 11.6 16.9 > txpush 2.2 NA* 21.8 > l2fwd 1.7 NA* 10.4 > Hi Bjorn,
This is very impressive work, thank you for contributing it! For these benchmarks how are the AF_PACKET and AF_XDP numbers to be compared. For instance is rxdpop comparable to XDP_DROP and "xdp_rxq_info --action XDP_DROP"? Given your explanation below, I believe they are, but it might be better to make that clear up front. Tom > * NA since there is no XDP_DRV mode (without ZC) for TX in this RFC, > see challenges below. > > If we start by comparing XDP_SKB performance with copy mode in > AF_PACKET V4, we can see that AF_XDP delivers 3-5 times the > throughput, which is positive. We are also happy with the XDP_DRV > performance that provides 11.6 Mpps for rxdrop, and should work on any > driver implementing full XDP support. Now to the problematic part: > XDP_DRV_ZC. The txpush (TX only) benchmark shows decent results at > 21.8 Mpps and is better than it was for V4, even though we have spent > no time optimizing the code in AF_XDP. (We did that in AF_PACKET V4.) > But the RX performance is sliced by half, which is not good. The > reason for this is, for the major part, the new buffer allocator which > is used for RX ZC only (at this point, see todo section). If you take > a look at the XDP baseline numbers, introducing the buffer pool > allocator drops the performance by around 30% or 10 Mpps which is > obviously not acceptable. We clearly need to give this code some > overdue performance love. But the overhanging question is how much > overhead it will produce in the end and if this will be > acceptable. Another thing to note is that V4 provided 33.7 Mpps for > rxdrop, but with AF_XDP we are quite unlikely to get above the > XDP_DROP number of 31.3, since we are reusing the XDP infrastructure > and driver code on the RX side. So in the end, the AF_XDP XDP_DRV_ZC > numbers will likely be lower than the V4 ZC numbers. > > We based this patch set on net-next commit 91e6dd828425 ("ipmr: Fix > ptrdiff_t print formatting"). > > Challenges: areas we would really appreciate your help on and that we > are having substantial problems with. > > * We would like to, if possible, use ndo_xdp_xmit and ndo_xdp_flush > instead of introducing another NDO in the form of > ndo_xdp_xmit_xsk. The first reason behind our ineptitude to be able > to accomplish this is that if both paths use ndo_xdp_xmit, they will > create a race as ndo_xdp_xmit currently does not contain any > locking. How to implement some type of mutual exclusion here without > resorting to slowing down the NDO with a lock? The second problem is > that the XDP_REDIRECT code implicitly assumes that core id = queue > id. AF_XDP, on the other hand, explicitly specifies a queue id that > has nothing to do with the core id (the application can run on any > core). How to reconcile these two views in one ndo? If these two > problems can be solved, then we would still need to introduce a > completion callback and a get_packet callback, but this seems to be > less challenging. This would also make it possible to run TX in the > XDP_DRV mode (with the default page allocator). > > * What should the buffer allocator look like and how to make it > generic enough so it can be used by all NIC vendors? Would be great > if you could take a look at it and come with suggestions. As you can > see from the change log, it took some effort to rewire the i40e code > to use the buff pool, and we expect the same to be true for many > other NICs. Ideas on how to introduce multiple allocator into XDP in > a less intrusive way would be highly appreciated. Another question > is how to create a buffer pool that gives rise to very little > overhead? We do not know if the current one can be optimized to have > an acceptable overhead as we have not started any optimization > effort. But we will give it a try during the next week or so to see > where it leads. > > * In this RFC, do not use an XDP_REDIRECT action other than > bpf_xdpsk_redirect for XDP_DRV_ZC. This is because a zero-copy > allocated buffer will then be sent to a cpu id / queue_pair through > ndo_xdp_xmit that does not know this has been ZC allocated. It will > then do a page_free on it and you will get a crash. How to extend > ndo_xdp_xmit with some free/completion function that could be called > instead of page_free? Hopefully, the same solution can be used here > as in the first problem item in this section. > > Caveats with this RFC. In contrast to the last section, we believe we > have solutions for these but we did not have time to fix them. We > chose to show you all the code sooner than later, even though > everything does not work. Sorry. > > * This RFC is more immature (read, has more bugs) than the AF_PACKET > V4 RFC. Some known mentioned here, others unknown. > > * We have done absolutely no optimization to this RFC. There is > (hopefully) some substantial low hanging fruit that we could fix > once we get to this, to improve XDP_DRV_ZC performance to levels > that we are not ashamed of and also bring the i40e driver to the > same performance levels it had before our changes, which is a must. > > * There is a race in the TX XSK clean up code in the i40e driver that > triggers a WARN_ON_ONCE. Clearly a bug that needs to be fixed. It > can be triggered by performing ifdown/ifup when the application is > running, or when changing the number of queues of the device > underneath the hood of the application. As a workaround, please > refrain from doing these two things without restarting the > application, as not all buffers will be returned in the TX > path. This bug can also be triggered when killing the application, > but has no negative effect in this case as the process will never > execute again. > > * Before this RFC, ndo_xdp_xmit triggered by an XDP_REDIRECT to a NIC > never modified the page count, so the redirect code could assume > that the page would still be valid after the NDO call. With the > introduction of the xsk_rcv path that is called as a result of an > XDP_REDIRECT to an AF_XDP socket, the page count will be decreased > if the page is copied out to user space, since we have no use for it > anymore. Our somewhat blunt solution to this is to make sure in the > i40e driver that the refcount is never under two. Note though, that > with the introduction of the buffer pool, this problem > disappears. This also means that XDP_DRV will not work out of the > box with a Niantic NIC, since it also needs this modification to > work. One question that we have is what should the semantics of > ndo_xdp_xmit be? Can we always assume that the page count will never > be changed by all possible netdevices that implement this NDO, or > should we remove this assumption to gain more device implementation > flexibility? > > To do: > > * Optimize performance. No optimization whatsoever was performed on > this RFC, in contrast to the previous one for AF_PACKET V4. > > * Kernel load module support. > > * Polling has not been implemented yet. > > * Optimize the user space sample application. It is simple but naive > at this point. The one for AF_PACKET V4 had a number of > optimizations that have not been introduced in the AF_XDP version. > > * Implement a way to pick the XDP_DRV mode even if XDP_DRV_ZC is > available. Would be nice to have for the sample application too. > > * Introduce a notifier chain for queue changes (caused by ethtool for > example). This would get rid of the error callback that we have at > this point. > > * Use one NAPI context for RX and another one for TX in i40e. This > would make it possible to run RX on one core and TX on another for > better performance. Today, they need to share a single core since > they share NAPI context. > > * Get rid of packet arrays (PA) and convert them to the buffer pool > allocator by transferring the necessary PA functionality into the > buffer pool. This has only been done for RX in ZC mode, while all > the other modes are still using packet arrays. Clearly, having two > structures with largely the same information is not a good thing. > > * Support for AF_XDP sockets without an XPD program loaded. In this > case all the traffic on a queue should go up to user space. > > * Support shared packet buffers > > * Support for packets spanning multiple frames > > Thanks: Björn and Magnus > > Björn Töpel (16): > xsk: AF_XDP sockets buildable skeleton > xsk: add user memory registration sockopt > xsk: added XDP_{R,T}X_RING sockopt and supporting structures > bpf: added bpf_xdpsk_redirect > net: wire up xsk support in the XDP_REDIRECT path > i40e: add support for XDP_REDIRECT > samples/bpf: added xdpsock program > xsk: add iterator functions to xsk_ring > i40e: introduce external allocator support > i40e: implemented page recycling buff_pool > i40e: start using recycling buff_pool > i40e: separated buff_pool interface from i40e implementaion > xsk: introduce xsk_buff_pool > xdp: added buff_pool support to struct xdp_buff > xsk: add support for zero copy Rx > i40e: implement xsk sub-commands in ndo_bpf for zero copy Rx > > Magnus Karlsson (8): > xsk: add bind support and introduce Rx functionality > xsk: introduce Tx functionality > netdevice: added XDP_{UN,}REGISTER_XSK command to ndo_bpf > netdevice: added ndo for transmitting a packet from an XDP socket > xsk: add support for zero copy Tx > i40e: introduced a clean_tx callback function > i40e: introduced Tx completion callbacks > i40e: Tx support for zero copy allocator > > drivers/net/ethernet/intel/i40e/Makefile | 3 +- > drivers/net/ethernet/intel/i40e/i40e.h | 24 + > drivers/net/ethernet/intel/i40e/i40e_buff_pool.c | 580 +++++++++++ > drivers/net/ethernet/intel/i40e/i40e_buff_pool.h | 15 + > drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 1 - > drivers/net/ethernet/intel/i40e/i40e_main.c | 541 +++++++++- > drivers/net/ethernet/intel/i40e/i40e_txrx.c | 906 +++++++++-------- > drivers/net/ethernet/intel/i40e/i40e_txrx.h | 119 ++- > include/linux/buff_pool.h | 136 +++ > include/linux/filter.h | 3 +- > include/linux/netdevice.h | 25 + > include/linux/socket.h | 5 +- > include/net/xdp.h | 1 + > include/net/xdp_sock.h | 60 ++ > include/uapi/linux/bpf.h | 6 +- > include/uapi/linux/if_xdp.h | 72 ++ > net/Kconfig | 1 + > net/Makefile | 1 + > net/core/dev.c | 28 +- > net/core/filter.c | 88 +- > net/core/sock.c | 12 +- > net/xdp/Kconfig | 7 + > net/xdp/Makefile | 1 + > net/xdp/xsk.c | 1142 > ++++++++++++++++++++++ > net/xdp/xsk.h | 31 + > net/xdp/xsk_buff.h | 161 +++ > net/xdp/xsk_buff_pool.c | 225 +++++ > net/xdp/xsk_buff_pool.h | 17 + > net/xdp/xsk_packet_array.c | 62 ++ > net/xdp/xsk_packet_array.h | 399 ++++++++ > net/xdp/xsk_ring.c | 61 ++ > net/xdp/xsk_ring.h | 419 ++++++++ > net/xdp/xsk_user_queue.h | 24 + > samples/bpf/Makefile | 4 + > samples/bpf/xdpsock_kern.c | 11 + > samples/bpf/xdpsock_queue.h | 62 ++ > samples/bpf/xdpsock_user.c | 642 ++++++++++++ > security/selinux/hooks.c | 4 +- > security/selinux/include/classmap.h | 4 +- > tools/testing/selftests/bpf/bpf_helpers.h | 2 + > 40 files changed, 5408 insertions(+), 497 deletions(-) > create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.c > create mode 100644 drivers/net/ethernet/intel/i40e/i40e_buff_pool.h > create mode 100644 include/linux/buff_pool.h > create mode 100644 include/net/xdp_sock.h > create mode 100644 include/uapi/linux/if_xdp.h > create mode 100644 net/xdp/Kconfig > create mode 100644 net/xdp/Makefile > create mode 100644 net/xdp/xsk.c > create mode 100644 net/xdp/xsk.h > create mode 100644 net/xdp/xsk_buff.h > create mode 100644 net/xdp/xsk_buff_pool.c > create mode 100644 net/xdp/xsk_buff_pool.h > create mode 100644 net/xdp/xsk_packet_array.c > create mode 100644 net/xdp/xsk_packet_array.h > create mode 100644 net/xdp/xsk_ring.c > create mode 100644 net/xdp/xsk_ring.h > create mode 100644 net/xdp/xsk_user_queue.h > create mode 100644 samples/bpf/xdpsock_kern.c > create mode 100644 samples/bpf/xdpsock_queue.h > create mode 100644 samples/bpf/xdpsock_user.c > > -- > 2.14.1 >