On Wed, Jan 30, 2013 at 3:14 AM, Rajahalme, Jarno (NSN - FI/Espoo) <jarno.rajaha...@nsn.com> wrote: > > On Jan 30, 2013, at 2:59 , ext Jesse Gross wrote: > >> On Tue, Jan 29, 2013 at 7:10 AM, Rajahalme, Jarno (NSN - FI/Espoo) >> <jarno.rajaha...@nsn.com> wrote: >>> >>> On Jan 24, 2013, at 19:41 , ext Jesse Gross wrote: >>> >>>> On Thu, Jan 24, 2013 at 7:34 AM, Jarno Rajahalme >>>> <jarno.rajaha...@nsn.com> wrote: >>>>> >>>>> On Jan 23, 2013, at 19:30 , ext Jesse Gross wrote: >>>>> >>>>>> On Tue, Jan 22, 2013 at 9:48 PM, Jarno Rajahalme >>>>>> <jarno.rajaha...@nsn.com> wrote: >>>>>>> Add OVS_PACKET_ATTR_KEY_INFO to relieve userspace from re-computing >>>>>>> data already computed within the kernel datapath. In the typical >>>>>>> case of an upcall with perfect key fitness between kernel and >>>>>>> userspace this eliminates flow_extract() and flow_hash() calls in >>>>>>> handle_miss_upcalls(). >>>>>>> >>>>>>> Additional bookkeeping within the kernel datapath is minimal. >>>>>>> Kernel flow insertion also saves one flow key hash computation. >>>>>>> >>>>>>> Removed setting the packet's l7 pointer for ICMP packets, as this was >>>>>>> never used. >>>>>>> >>>>>>> Signed-off-by: Jarno Rajahalme <jarno.rajaha...@nsn.com> >>>>>>> --- >>>>>>> >>>>>>> This likely requires some discussion, but it took a while for me to >>>>>>> understand why each packet miss upcall would require flow_extract() >>>>>>> right after the flow key has been obtained from odp attributes. >>>>>> >>>>>> Do you have any performance numbers to share? Since this is an >>>>>> optimization it's important to understand if the benefit is worth the >>>>>> extra complexity. >>>>> >>>>> Not yet, but would be happy to. Any hits towards for the best way of >>>>> obtaining >>>>> meaningful numbers for something like this? >>>> >>>> This is a flow setup optimization, so usually something like netperf >>>> TCP_CRR would be a good way to stress that. >>>> >>>> However, Ben mentioned to me that he had tried eliminating the >>>> flow_extract() call from userspace in the past and the results were >>>> disappointing. >>> >>> I made a simple test, where there is only one flow entry "in_port=LOCAL >>> actions=drop", and only the local port is configured. One process sends UDP >>> packets with different source/destination port combinations in a loop. OVS >>> then tries to cope with the load. During the test both processes run near >>> 100% CPU utilization in a virtual machine on a dual-core laptop. On each >>> round 10100000 packets were generated: >>> >>> OFPST_PORT reply (xid=0x2): 1 ports >>> port LOCAL: rx pkts=10100006, bytes=464600468, drop=0, errs=0, frame=0, >>> over=0, crc=0 >>> tx pkts=0, bytes=0, drop=0, errs=0, coll=0 >>> >>> With current master 19.35% of packets on average get processed by the flow: >>> >>> Round 1: >>> NXST_FLOW reply (xid=0x4): >>> cookie=0x0, duration=29.124s, table=0, n_packets=1959794, n_bytes=90150548, >>> idle_age=4, in_port=LOCAL actions=drop >>> >>> Round 2: >>> NXST_FLOW reply (xid=0x4): >>> cookie=0x0, duration=63.534s, table=0, n_packets=1932785, n_bytes=88908158, >>> idle_age=37, in_port=LOCAL actions=drop >>> >>> Round 3: >>> NXST_FLOW reply (xid=0x4): >>> cookie=0x0, duration=33.394s, table=0, n_packets=1972389, n_bytes=90729894, >>> idle_age=8, in_port=LOCAL actions=drop >>> >>> >>> With the proposed change 20.2% of packets on average get processed by the >>> flow: >>> >>> Round 4: >>> NXST_FLOW reply (xid=0x4): >>> cookie=0x0, duration=31.96s, table=0, n_packets=2042759, n_bytes=93966914, >>> idle_age=4, in_port=LOCAL actions=drop >>> >>> Round 5: >>> NXST_FLOW reply (xid=0x4): >>> cookie=0x0, duration=38.6s, table=0, n_packets=2040224, n_bytes=93850372, >>> idle_age=8, in_port=LOCAL actions=drop >>> >>> Round 6: >>> NXST_FLOW reply (xid=0x4): >>> cookie=0x0, duration=35.661s, table=0, n_packets=2039595, n_bytes=93821418, >>> idle_age=3, in_port=LOCAL actions=drop >>> >>> >>> So there is a consistent benefit, but it is not very large. Seemingly the >>> flow_extract() and flow_hash() represent only a small portion of the OVS >>> flow setup CPU use. >> >> Thanks for testing this out to get some concrete numbers. >> >> One thing that comes to mind is whether we actually use the layer >> pointers in the packet all that often. It seems to me that in cases >> where we are actually able to setup a kernel flow, userspace should be >> able to do all of its work by only looking at that flow. The other >> cases should be rare, like if userspace is directly consuming the >> packet or if the fit is not perfect. If that's the case, could we get >> a similar benefit without touching the userspace/kernel interface by >> only initializing the pointers on demand? > > Otherwise not so much use of the layer pointers, but dpif_flow_stats_extract() > does packet_get_tcp_flags(), which needs the layer pointers. > dpif_flow_stats_extract() is called on each miss packet (but the collected > tcp_flags are used only with facets, for netflow and facet->tcp_flags). > Other use is for controller action and special processing (stpp/cfm/lacp). > > So it seems that deferred layer pointer setting could be effective only if the > stats extraction would be changed to skip tcp_flags for the cases where they > are not going to be used (i.e. when not making facets, as might be expected > under heavy load?).
That's a good point about the TCP flags. I think it's actually not right that we don't use the TCP flags when we don't create facets under load. Not creating facets is an optimization and that shouldn't affect the results that are seen through NetFlow. I suspect that there are quite a lot of cases where NetFlow isn't used (NXAST_FIN_TIMEOUT is the other case that comes to mind that uses the flags) so it might be worthwhile to turn off collection when those features aren't in use. > Also, I noticed that the facets hmap uses the hash from flow_miss, and > according to the comments that should be flow_hash(flow, 0). However, > the only place the hash originates in the code is at handle_miss_upcalls(), > so again, there is no harm in using a kernel provided hash instead. But the > comments need to be updated if this change is adopted. It would probably be good to benchmark the results of these two changes independently to see which is more significant. In general, I'm nervous about putting optimizations into the userspace/kernel interface because it exposes implementation information. Keeping a neutral format can actually make it easier to optimize in the future. In particular, I know that Ben is thinking about grouping together flows at a coarser granularity than the kernel uses, which would make passing the hash up not that useful. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev