sched: Add hardware specific counters to TC actions

Jakub Kicinski Wed, 29 Aug 2018 11:12:56 -0700

On Wed, 29 Aug 2018 11:43:47 +0200, Eelco Chaudron wrote:
> On 23 Aug 2018, at 20:14, Jakub Kicinski wrote:
> 
> > On Mon, 20 Aug 2018 16:03:40 +0200, Eelco Chaudron wrote:  
> >> On 17 Aug 2018, at 13:27, Jakub Kicinski wrote:  
> >>> On Thu, 16 Aug 2018 14:02:44 +0200, Eelco Chaudron wrote:  
> >>>> On 11 Aug 2018, at 21:06, David Miller wrote:
> >>>>  
> >>>>> From: Jakub Kicinski <jakub.kicin...@netronome.com>
> >>>>> Date: Thu, 9 Aug 2018 20:26:08 -0700
> >>>>>  
> >>>>>> It is not immediately clear why this is needed.  The memory and
> >>>>>> updating two sets of counters won't come for free, so perhaps a
> >>>>>> stronger justification than troubleshooting is due? :S
> >>>>>>
> >>>>>> Netdev has counters for fallback vs forwarded traffic, so you'd
> >>>>>> know
> >>>>>> that traffic hits the SW datapath, plus the rules which are in_hw
> >>>>>> will
> >>>>>> most likely not match as of today for flower (assuming
> >>>>>> correctness).  
> >>>>
> >>>> I strongly believe that these counters are a requirement for a 
> >>>> mixed
> >>>> software/hardware (flow) based forwarding environment. The global
> >>>> counters will not help much here as you might have chosen to have
> >>>> certain traffic forwarded by software.
> >>>>
> >>>> These counters are probably the only option you have to figure out
> >>>> why
> >>>> forwarding is not as fast as expected, and you want to blame the TC
> >>>> offload NIC.  
> >>>
> >>> The suggested debugging flow would be:
> >>>  (1) check the global counter for fallback are incrementing;
> >>>  (2) find a flow with high stats but no in_hw flag set.
> >>>
> >>> The in_hw indication should be sufficient in most cases (unless 
> >>> there
> >>> are shared blocks between netdevs of different ASICs...).  
> >>
> >> I guess the aim is to find miss behaving hardware, i.e. having the 
> >> in_hw
> >> flag set, but flows still coming to the kernel.  
> >
> > For misbehaving hardware in_hw will not work indeed.  Whether we need
> > these extra always-on stats for such use case could be debated :)
> >  
> >>>>>> I'm slightly concerned about potential performance impact, would
> >>>>>> you
> >>>>>> be able to share some numbers for non-trivial number of flows 
> >>>>>> (100k
> >>>>>> active?)?  
> >>>>>
> >>>>> Agreed, features used for diagnostics cannot have a harmful 
> >>>>> penalty
> >>>>> for fast path performance.  
> >>>>
> >>>> Fast path performance is not affected as these counters are not
> >>>> incremented there. They are only incremented by the nic driver when
> >>>> they
> >>>> gather their statistics from hardware.  
> >>>
> >>> Not by much, you are adding state to performance-critical 
> >>> structures,
> >>> though, for what is effectively debugging purposes.
> >>>
> >>> I was mostly talking about the HW offload stat updates (sorry for 
> >>> not
> >>> being clear).
> >>>
> >>> We can have some hundreds of thousands active offloaded flows, each 
> >>> of
> >>> them can have multiple actions, and stats have to be updated 
> >>> multiple
> >>> times per second and dumped probably around once a second, too.  On 
> >>> a
> >>> busy system the stats will get evicted from cache between each 
> >>> round.
> >>>
> >>> But I'm speculating let's see if I can get some numbers on it (if 
> >>> you
> >>> could get some too, that would be great!).  
> >>
> >> I’ll try to measure some of this later this week/early next week.  
> >
> > I asked Louis to run some tests while I'm travelling, and he reports
> > that my worry about reporting the extra stats was unfounded.  Update
> > function does not show up in traces at all.  It seems under stress
> > (generated with stress-ng) the thread dumping the stats in userspace
> > (in OvS it would be the revalidator) actually consumes less CPU in
> > __gnet_stats_copy_basic (0.4% less for ~2.0% total).
> >
> > Would this match with your results?  I'm not sure why dumping would be
> > faster with your change..  
> 
> Tested with OVS and https://github.com/chaudron/ovs_perf using 300K TC 
> rules installed in HW.
> 
> For __gnet_stats_copy_basic() being faster I have (had) a theory. Now 
> this function is called twice, and I assumed the first call would cache 
> memory and the second call would be faster.
> 
> Sampling a lot of perf data, I get an average of 1115ns with the base 
> kernel and 954ns with the fix applied, so about ~14%.
> 
> Thought I would perf tcf_action_copy_stats() as it is the place updating 
> the additional counter. But even in this case, I see a better 
> performance with the patch applied.
> 
> In average 13581ns with the fix, vs base kernel at 1391ns, so about 
> 2.3%.
> 
> I guess the changes to the tc_action structure got better cache 
> alignment.


Interesting you could reproduce the speed up too!  +1 for the guess.
Seems like my caution about slowing down SW paths to support HW offload
landed on a very unfortunate patch :)

Re: [PATCH 0/2] net/sched: Add hardware specific counters to TC actions

Reply via email to