From: Ilias Apalodimas <ilias.apalodi...@linaro.org> Date: Wed, 24 Mar 2021 09:50:38 +0200
> Hi Alexander, Hi! > On Tue, Mar 23, 2021 at 08:03:46PM +0000, Alexander Lobakin wrote: > > From: Ilias Apalodimas <ilias.apalodi...@linaro.org> > > Date: Tue, 23 Mar 2021 19:01:52 +0200 > > > > > On Tue, Mar 23, 2021 at 04:55:31PM +0000, Alexander Lobakin wrote: > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > Thanks for the testing! > > > > > > > > Any chance you can get a perf measurement on this? > > > > > > > > > > > > > > I guess you mean perf-report (--stdio) output, right? > > > > > > > > > > > > > > > > > > > Yea, > > > > > > As hinted below, I am just trying to figure out if on Alexander's > > > > > > platform the > > > > > > cost of syncing, is bigger that free-allocate. I remember one armv7 > > > > > > were that > > > > > > was the case. > > > > > > > > > > > > > > Is DMA syncing taking a substantial amount of your cpu usage? > > > > > > > > > > > > > > (+1 this is an important question) > > > > > > > > Sure, I'll drop perf tools to my test env and share the results, > > > > maybe tomorrow or in a few days. > > > > Oh we-e-e-ell... > > Looks like I've been fooled by I-cache misses or smth like that. > > That happens sometimes, not only on my machines, and not only on > > MIPS if I'm not mistaken. > > Sorry for confusing you guys. > > > > I got drastically different numbers after I enabled CONFIG_KALLSYMS + > > CONFIG_PERF_EVENTS for perf tools. > > The only difference in code is that I rebased onto Mel's > > mm-bulk-rebase-v6r4. > > > > (lunar is my WIP NIC driver) > > > > 1. 5.12-rc3 baseline: > > > > TCP: 566 Mbps > > UDP: 615 Mbps > > > > perf top: > > 4.44% [lunar] [k] lunar_rx_poll_page_pool > > 3.56% [kernel] [k] r4k_wait_irqoff > > 2.89% [kernel] [k] free_unref_page > > 2.57% [kernel] [k] dma_map_page_attrs > > 2.32% [kernel] [k] get_page_from_freelist > > 2.28% [lunar] [k] lunar_start_xmit > > 1.82% [kernel] [k] __copy_user > > 1.75% [kernel] [k] dev_gro_receive > > 1.52% [kernel] [k] cpuidle_enter_state_coupled > > 1.46% [kernel] [k] tcp_gro_receive > > 1.35% [kernel] [k] __rmemcpy > > 1.33% [nf_conntrack] [k] nf_conntrack_tcp_packet > > 1.30% [kernel] [k] __dev_queue_xmit > > 1.22% [kernel] [k] pfifo_fast_dequeue > > 1.17% [kernel] [k] skb_release_data > > 1.17% [kernel] [k] skb_segment > > > > free_unref_page() and get_page_from_freelist() consume a lot. > > > > 2. 5.12-rc3 + Page Pool recycling by Matteo: > > TCP: 589 Mbps > > UDP: 633 Mbps > > > > perf top: > > 4.27% [lunar] [k] lunar_rx_poll_page_pool > > 2.68% [lunar] [k] lunar_start_xmit > > 2.41% [kernel] [k] dma_map_page_attrs > > 1.92% [kernel] [k] r4k_wait_irqoff > > 1.89% [kernel] [k] __copy_user > > 1.62% [kernel] [k] dev_gro_receive > > 1.51% [kernel] [k] cpuidle_enter_state_coupled > > 1.44% [kernel] [k] tcp_gro_receive > > 1.40% [kernel] [k] __rmemcpy > > 1.38% [nf_conntrack] [k] nf_conntrack_tcp_packet > > 1.37% [kernel] [k] free_unref_page > > 1.35% [kernel] [k] __dev_queue_xmit > > 1.30% [kernel] [k] skb_segment > > 1.28% [kernel] [k] get_page_from_freelist > > 1.27% [kernel] [k] r4k_dma_cache_inv > > > > +20 Mbps increase on both TCP and UDP. free_unref_page() and > > get_page_from_freelist() dropped down the list significantly. > > > > 3. 5.12-rc3 + Page Pool recycling + PP bulk allocator (Mel & Jesper): > > TCP: 596 > > UDP: 641 > > > > perf top: > > 4.38% [lunar] [k] lunar_rx_poll_page_pool > > 3.34% [kernel] [k] r4k_wait_irqoff > > 3.14% [kernel] [k] dma_map_page_attrs > > 2.49% [lunar] [k] lunar_start_xmit > > 1.85% [kernel] [k] dev_gro_receive > > 1.76% [kernel] [k] free_unref_page > > 1.76% [kernel] [k] __copy_user > > 1.65% [kernel] [k] inet_gro_receive > > 1.57% [kernel] [k] tcp_gro_receive > > 1.48% [kernel] [k] cpuidle_enter_state_coupled > > 1.43% [nf_conntrack] [k] nf_conntrack_tcp_packet > > 1.42% [kernel] [k] __rmemcpy > > 1.25% [kernel] [k] skb_segment > > 1.21% [kernel] [k] r4k_dma_cache_inv > > > > +10 Mbps on top of recycling. > > get_page_from_freelist() is gone. > > NAPI polling, CPU idle cycle (r4k_wait_irqoff) and DMA mapping > > routine became the top consumers. > > Again, thanks for the extensive testing. > I assume you dont use page pool to map the buffers right? > Because if the ampping is preserved the only thing you have to do is sync it > after the packet reception No, I use Page Pool for both DMA mapping and synching for device. The reason why DMA mapping takes a lot of CPU is that I test NATing, so NIC firstly receives the frames and then xmits them with modified headers -> this DMA map overhead is from lunar_start_xmit(), not Rx path. The actual Rx synching is r4k_dma_cache_inv() and it's not that expensive. > > > > 4-5. __always_inline for rmqueue_bulk() and __rmqueue_pcplist(), > > removing 'noinline' from net/core/page_pool.c etc. > > > > ...makes absolutely no sense anymore. > > I see Mel took Jesper's patch to make __rmqueue_pcplist() inline into > > mm-bulk-rebase-v6r5, not sure if it's really needed now. > > > > So I'm really glad we sorted out the things and I can see the real > > performance improvements from both recycling and bulk allocations. > > > > Those will probably be even bigger with and io(sm)/mu present Sure, DMA mapping is way more expensive through IOMMUs. I don't have one on my boards, so can't collect any useful info. > [...] > > Cheers > /Ilias Thanks, Al