On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote: > On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer > wrote: > ... ... > > Section copied out: > > > > mlx5e_poll_tx_cq > > | > > --16.34%--napi_consume_skb > > | > > |--12.65%--__free_pages_ok > > | | > > | --11.86%--free_one_page > > | | > > | |--10.10% > > --queued_spin_lock_slowpath > > | | > > | --0.65%--_raw_spin_lock > > This callchain looks like it is freeing higher order pages than order > 0: > __free_pages_ok is only called for pages whose order are bigger than > 0.
mlx5 rx uses only order 0 pages, so i don't know where these high order tx SKBs are coming from.. > > > | > > |--1.55%--page_frag_free > > | > > --1.44%--skb_release_data > > > > > > Let me explain what (I think) happens. The mlx5 driver RX-page > > recycle > > mechanism is not effective in this workload, and pages have to go > > through the page allocator. The lock contention happens during > > mlx5 > > DMA TX completion cycle. And the page allocator cannot keep up at > > these speeds. > > > > One solution is extend page allocator with a bulk free API. (This > > have > > been on my TODO list for a long time, but I don't have a > > micro-benchmark that trick the driver page-recycle to fail). It > > should > > fit nicely, as I can see that kmem_cache_free_bulk() does get > > activated (bulk freeing SKBs), which means that DMA TX completion > > do > > have a bulk of packets. > > > > We can (and should) also improve the page recycle scheme in the > > driver. > > After LPC, I have a project with Tariq and Ilias (Cc'ed) to improve > > the > > page_pool, and we will (attempt) to generalize this, for both high- > > end > > mlx5 and more low-end ARM64-boards (macchiatobin and espressobin). > > > > The MM-people is in parallel working to improve the performance of > > order-0 page returns. Thus, the explicit page bulk free API might > > actually become less important. I actually think (Cc.) Aaron have > > a > > patchset he would like you to test, which removes the (zone->)lock > > you hit in free_one_page(). > > Thanks Jesper. > > Yes, the said patchset is in this branch: > https://github.com/aaronlu/linux no_merge_cluster_alloc_4.19-rc5 > > But as I said above, I think the lock contention here is for > order > 0 pages so my current patchset will not work here, > unfortunately. > > BTW, Mel Gorman has suggested an alternative way to improve page > allocator's scalability and I'm working on it right now, it will > improve page allocator's scalability for all order pages. I might be > able to post it some time next week, will CC all of you when it's > ready.