On Mon, 22 Mar 2021 18:30:55 +0000 Alexander Lobakin <aloba...@pm.me> wrote:
> As per disscussion in Page Pool bulk allocator thread [0], > there are two functions in Page Pool core code that are marked as > 'noinline'. The reason for this is not so clear, and even if it > was made to reduce hotpath overhead, in fact it only makes things > worse. > As both of these functions as being called only once through the > code, they could be inlined/folded into the non-static entry point. > However, 'noinline' marks effectively prevent from doing that and > induce totally unneeded fragmentation (baseline -> after removal): > > add/remove: 0/3 grow/shrink: 1/0 up/down: 1024/-1096 (-72) > Function old new delta > page_pool_alloc_pages 100 1124 +1024 > page_pool_dma_map 164 - -164 > page_pool_refill_alloc_cache 332 - -332 > __page_pool_alloc_pages_slow 600 - -600 > > (taken from Mel's branch, hence factored-out page_pool_dma_map()) I see that the refactor of page_pool_dma_map() caused it to be uninlined, that were a mistake. Thanks for high-lighting that again as I forgot about this (even-though I think Alex Duyck did point this out earlier). I am considering if we should allow compiler to inline page_pool_refill_alloc_cache + __page_pool_alloc_pages_slow, for the sake of performance and I loose the ability to diagnose the behavior from perf-report. Mind that page_pool avoids stat for the sake of performance, but these noinline makes it possible to diagnose the behavior anyway. > > 1124 is a normal hotpath frame size, but these jumps between tiny > page_pool_alloc_pages(), page_pool_refill_alloc_cache() and > __page_pool_alloc_pages_slow() are really redundant and harmful > for performance. Well, I disagree. (this is a NACK) If pages were recycled then the code never had to visit __page_pool_alloc_pages_slow(). And today without the bulk page-alloc (that we are working on adding together with Mel) we have to visit __page_pool_alloc_pages_slow() every time, which is a bad design, but I'm trying to fix that. Matteo is working on recycling here[1]: [1] https://lore.kernel.org/netdev/20210322170301.26017-1-mcr...@linux.microsoft.com/ It would be really great if you could try out his patchset, as it will help your driver avoid the slow path of the page_pool. Given you are very detailed oriented, I do want to point out that Matteo's patchset is only the first step, as to really improve performance for page_pool, we need to bulk return these page_pool pages (it is requires some restructure of the core code, that will be confusing at this point). > This simple removal of 'noinline' keywords bumps the throughput > on XDP_PASS + napi_build_skb() + napi_gro_receive() on 25+ Mbps > for 1G embedded NIC. > > [0] https://lore.kernel.org/netdev/20210317222506.1266004-1-aloba...@pm.me > > Signed-off-by: Alexander Lobakin <aloba...@pm.me> > --- > net/core/page_pool.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index ad8b0707af04..589e4df6ef2b 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -102,7 +102,6 @@ EXPORT_SYMBOL(page_pool_create); > > static void page_pool_return_page(struct page_pool *pool, struct page *page); > > -noinline > static struct page *page_pool_refill_alloc_cache(struct page_pool *pool) > { > struct ptr_ring *r = &pool->ring; > @@ -181,7 +180,6 @@ static void page_pool_dma_sync_for_device(struct > page_pool *pool, > } > > /* slow path */ > -noinline > static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, > gfp_t _gfp) > { > -- > 2.31.0 > > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer