Hi Vlad and Eric, On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote: > On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <vla...@mellanox.com> wrote: > > > > Hi Eric, > > > > I've been investigating significant tc filter insertion rate degradation > > and it seems it is caused by your commit 001c96db0181 ("net: align > > gnet_stats_basic_cpu struct"). With this commit insertion rate is > > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules > > from file in tc batch mode on my machine. > > > > Tc perf profile indicates that pcpu allocator now consumes 2x CPU: > > > > 1) Before: > > > > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071 > > Children Self Co Shared Object Symbol > > + 21.19% 3.38% tc [kernel.vmlinux] [k] pcpu_alloc > > + 3.45% 0.25% tc [kernel.vmlinux] [k] pcpu_alloc_area > > > > 2) After: > > > > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550 > > Children Self Co Shared Object Symbol > > + 44.67% 3.99% tc [kernel.vmlinux] [k] pcpu_alloc > > + 19.25% 0.22% tc [kernel.vmlinux] [k] pcpu_alloc_area > > > > It seems that it takes much more work for pcpu allocator to perform > > allocation with new stricter alignment requirements. Not sure if it is > > expected behavior or not in this case. > > > > Regards, > > Vlad
Would you mind sharing a little more information with me: 1) output before and after a run of /sys/kernel/debug/percpu_stats 2) a full perf output 3) a reproducer I'm a little surprised we're spending time in pcpu_alloc_area(), but it might be due to constantly breaking the hint as an immediate guess. > > Hi Vlad > > I guess this is more a question for per-cpu allocator experts / maintainers ? > > 16-bytes alignment for 16-bytes objects sound quite reasonable [1] > The alignment request seems reasonable. But as Tejun mentioned in a reply to this, the overhead of forced alignment would be both in percpu memory itself and in allocation time due to the stricter requirement. Thanks, Dennis