Re: tc filter insertion rate degradation

Dennis Zhou Thu, 24 Jan 2019 09:22:05 -0800

Hi Vlad and Eric,

On Tue, Jan 22, 2019 at 09:33:10AM -0800, Eric Dumazet wrote:
> On Mon, Jan 21, 2019 at 3:24 AM Vlad Buslov <[email protected]> wrote:
> >
> > Hi Eric,
> >
> > I've been investigating significant tc filter insertion rate degradation
> > and it seems it is caused by your commit 001c96db0181 ("net: align
> > gnet_stats_basic_cpu struct"). With this commit insertion rate is
> > reduced from ~65k rules/sec to ~43k rules/sec when inserting 1m rules
> > from file in tc batch mode on my machine.
> >
> > Tc perf profile indicates that pcpu allocator now consumes 2x CPU:
> >
> > 1) Before:
> >
> > Samples: 63K of event 'cycles:ppp', Event count (approx.): 48796480071
> >   Children      Self  Co  Shared Object     Symbol
> > +   21.19%     3.38%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> > +    3.45%     0.25%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
> >
> > 2) After:
> >
> > Samples1: 92K of event 'cycles:ppp', Event count (approx.): 71446806550
> >   Children      Self  Co  Shared Object     Symbol
> > +   44.67%     3.99%  tc  [kernel.vmlinux]  [k] pcpu_alloc
> > +   19.25%     0.22%  tc  [kernel.vmlinux]  [k] pcpu_alloc_area
> >
> > It seems that it takes much more work for pcpu allocator to perform
> > allocation with new stricter alignment requirements. Not sure if it is
> > expected behavior or not in this case.
> >
> > Regards,
> > Vlad


Would you mind sharing a little more information with me:
1) output before and after a run of /sys/kernel/debug/percpu_stats
2) a full perf output
3) a reproducer

I'm a little surprised we're spending time in pcpu_alloc_area(), but it
might be due to constantly breaking the hint as an immediate guess.

> 
> Hi Vlad
> 
> I guess this is more a question for per-cpu allocator experts / maintainers ?
> 
> 16-bytes alignment for 16-bytes objects sound quite reasonable [1]
> 

The alignment request seems reasonable. But as Tejun mentioned in a
reply to this, the overhead of forced alignment would be both in percpu
memory itself and in allocation time due to the stricter requirement.

Thanks,
Dennis

Re: tc filter insertion rate degradation

Reply via email to