This patchset change ndo_xdp_xmit API to take a bulk of xdp frames. When kernel is compiled with CONFIG_RETPOLINE, every indirect function pointer (branch) call hurts performance. For XDP this have a huge negative performance impact.
This patchset reduce the needed (indirect) calls to ndo_xdp_xmit, but also prepares for further optimizations. The DMA APIs use of indirect function pointer calls is the primary source the regression. It is left for a followup patchset, to use bulking calls towards the DMA API (via the scatter-gatter calls). The other advantage of this API change is that drivers can easier amortize the cost of any sync/locking scheme, over the bulk of packets. The assumption of the current API is that the driver implemementing the NDO will also allocate a dedicated XDP TX queue for every CPU in the system. Which is not always possible or practical to configure. E.g. ixgbe cannot load an XDP program on a machine with more than 96 CPUs, due to limited hardware TX queues. E.g. virtio_net is hard to configure as it requires manually increasing the queues. E.g. tun driver chooses to use a per XDP frame producer lock modulo smp_processor_id over avail queues. --- Jesper Dangaard Brouer (4): bpf: devmap introduce dev_map_enqueue bpf: devmap prepare xdp frames for bulking xdp: add tracepoint for devmap like cpumap have xdp: change ndo_xdp_xmit API to support bulking drivers/net/ethernet/intel/i40e/i40e_txrx.c | 26 ++++- drivers/net/ethernet/intel/i40e/i40e_txrx.h | 2 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 21 +++- drivers/net/tun.c | 37 ++++--- drivers/net/virtio_net.c | 66 +++++++++--- include/linux/bpf.h | 15 ++- include/linux/netdevice.h | 14 ++- include/net/xdp.h | 1 include/trace/events/xdp.h | 50 +++++++++ kernel/bpf/devmap.c | 134 ++++++++++++++++++++++++- net/core/filter.c | 19 +--- net/core/xdp.c | 14 ++- samples/bpf/xdp_monitor_kern.c | 49 +++++++++ samples/bpf/xdp_monitor_user.c | 69 +++++++++++++ 14 files changed, 436 insertions(+), 81 deletions(-) --