On Wed, 29 Aug 2018 11:43:47 +0200, Eelco Chaudron wrote: > On 23 Aug 2018, at 20:14, Jakub Kicinski wrote: > > > On Mon, 20 Aug 2018 16:03:40 +0200, Eelco Chaudron wrote: > >> On 17 Aug 2018, at 13:27, Jakub Kicinski wrote: > >>> On Thu, 16 Aug 2018 14:02:44 +0200, Eelco Chaudron wrote: > >>>> On 11 Aug 2018, at 21:06, David Miller wrote: > >>>> > >>>>> From: Jakub Kicinski <jakub.kicin...@netronome.com> > >>>>> Date: Thu, 9 Aug 2018 20:26:08 -0700 > >>>>> > >>>>>> It is not immediately clear why this is needed. The memory and > >>>>>> updating two sets of counters won't come for free, so perhaps a > >>>>>> stronger justification than troubleshooting is due? :S > >>>>>> > >>>>>> Netdev has counters for fallback vs forwarded traffic, so you'd > >>>>>> know > >>>>>> that traffic hits the SW datapath, plus the rules which are in_hw > >>>>>> will > >>>>>> most likely not match as of today for flower (assuming > >>>>>> correctness). > >>>> > >>>> I strongly believe that these counters are a requirement for a > >>>> mixed > >>>> software/hardware (flow) based forwarding environment. The global > >>>> counters will not help much here as you might have chosen to have > >>>> certain traffic forwarded by software. > >>>> > >>>> These counters are probably the only option you have to figure out > >>>> why > >>>> forwarding is not as fast as expected, and you want to blame the TC > >>>> offload NIC. > >>> > >>> The suggested debugging flow would be: > >>> (1) check the global counter for fallback are incrementing; > >>> (2) find a flow with high stats but no in_hw flag set. > >>> > >>> The in_hw indication should be sufficient in most cases (unless > >>> there > >>> are shared blocks between netdevs of different ASICs...). > >> > >> I guess the aim is to find miss behaving hardware, i.e. having the > >> in_hw > >> flag set, but flows still coming to the kernel. > > > > For misbehaving hardware in_hw will not work indeed. Whether we need > > these extra always-on stats for such use case could be debated :) > > > >>>>>> I'm slightly concerned about potential performance impact, would > >>>>>> you > >>>>>> be able to share some numbers for non-trivial number of flows > >>>>>> (100k > >>>>>> active?)? > >>>>> > >>>>> Agreed, features used for diagnostics cannot have a harmful > >>>>> penalty > >>>>> for fast path performance. > >>>> > >>>> Fast path performance is not affected as these counters are not > >>>> incremented there. They are only incremented by the nic driver when > >>>> they > >>>> gather their statistics from hardware. > >>> > >>> Not by much, you are adding state to performance-critical > >>> structures, > >>> though, for what is effectively debugging purposes. > >>> > >>> I was mostly talking about the HW offload stat updates (sorry for > >>> not > >>> being clear). > >>> > >>> We can have some hundreds of thousands active offloaded flows, each > >>> of > >>> them can have multiple actions, and stats have to be updated > >>> multiple > >>> times per second and dumped probably around once a second, too. On > >>> a > >>> busy system the stats will get evicted from cache between each > >>> round. > >>> > >>> But I'm speculating let's see if I can get some numbers on it (if > >>> you > >>> could get some too, that would be great!). > >> > >> I’ll try to measure some of this later this week/early next week. > > > > I asked Louis to run some tests while I'm travelling, and he reports > > that my worry about reporting the extra stats was unfounded. Update > > function does not show up in traces at all. It seems under stress > > (generated with stress-ng) the thread dumping the stats in userspace > > (in OvS it would be the revalidator) actually consumes less CPU in > > __gnet_stats_copy_basic (0.4% less for ~2.0% total). > > > > Would this match with your results? I'm not sure why dumping would be > > faster with your change.. > > Tested with OVS and https://github.com/chaudron/ovs_perf using 300K TC > rules installed in HW. > > For __gnet_stats_copy_basic() being faster I have (had) a theory. Now > this function is called twice, and I assumed the first call would cache > memory and the second call would be faster. > > Sampling a lot of perf data, I get an average of 1115ns with the base > kernel and 954ns with the fix applied, so about ~14%. > > Thought I would perf tcf_action_copy_stats() as it is the place updating > the additional counter. But even in this case, I see a better > performance with the patch applied. > > In average 13581ns with the fix, vs base kernel at 1391ns, so about > 2.3%. > > I guess the changes to the tc_action structure got better cache > alignment.
Interesting you could reproduce the speed up too! +1 for the guess. Seems like my caution about slowing down SW paths to support HW offload landed on a very unfortunate patch :)