On Mon, 1 Oct 2018 14:20:21 +0300 Ilias Apalodimas <ilias.apalodi...@linaro.org> wrote:
> On Mon, Oct 01, 2018 at 01:03:13PM +0200, Jesper Dangaard Brouer wrote: > > On Mon, 1 Oct 2018 12:56:58 +0300 > > Ilias Apalodimas <ilias.apalodi...@linaro.org> wrote: > > > > > > > #2: You have allocations on the XDP fast-path. > > > > > > > > > > The REAL secret behind the XDP performance is to avoid allocations on > > > > > the fast-path. While I just told you to use the page-allocator and > > > > > order-0 pages, this will actually kill performance. Thus, to make > > > > > this > > > > > fast, you need a driver local recycle scheme that avoids going through > > > > > the page allocator, which makes XDP_DROP and XDP_TX extremely fast. > > > > > For the XDP_REDIRECT action (which you seems to be interested in, as > > > > > this is needed for AF_XDP), there is a xdp_return_frame() API that can > > > > > make this fast. > > > > > > > > I had an initial implementation that did exactly that (that's why you > > > > the > > > > dma_sync_single_for_cpu() -> dma_unmap_single_attrs() is there). In the > > > > case > > > > of AF_XDP isn't that introducing a 'bottleneck' though? I mean you'll > > > > feed fresh > > > > buffers back to the hardware only when your packets have been processed > > > > from > > > > your userspace application > > > > > > Just a clarification here. This is the case if ZC is implemented. In my > > > case > > > the buffers will be 'ok' to be passed back to the hardware once the use > > > userspace payload has been copied by xdp_do_redirect() > > > > Thanks for clarifying. But no, this is not introducing a 'bottleneck' > > for AF_XDP. > > > > For (1) the copy-mode-AF_XDP the frame (as you noticed) is "freed" or > > "returned" very quickly after it is copied. The code is a bit hard to > > follow, but in __xsk_rcv() it calls xdp_return_buff() after the memcpy. > > Thus, the frame can be kept DMA mapped and reused in RX-ring quickly. > > Ok makes sense. I'll send a v4 with page re-usage, while using your > API for page allocation Sound good, BUT do notice that using the bare page_pool, will/should give you increased XDP performance, but might slow-down normal network stack delivery, because netstack will not call xdp_return_frame() and instead falls back to returning the pages through the page-allocator. I'm very interested in knowing what performance increase you see with XDP_DROP, with just a "bare" page_pool implementation. The mlx5 driver does not see this netstack slowdown, because it have a hybrid approach of maintaining a recycle ring for frames going into netstack, by bumping the refcnt. I think Tariq is cleaning this up. The mlx5 code is hard to follow... in mlx5e_xdp_handle()[1] the refcnt==1 and a bit is set. And in [2] the refcnt is page_ref_inc(), and bit is caught in [3]. (This really need to be cleaned up and generalized). [1] https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c#L83-L88 https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L952-L959 [2] https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L1015-L1025 [3] https://github.com/torvalds/linux/blob/v4.18/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L1094-L1098 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer