On Jan 30, 2013, at 2:59 , ext Jesse Gross wrote: > On Tue, Jan 29, 2013 at 7:10 AM, Rajahalme, Jarno (NSN - FI/Espoo) > <jarno.rajaha...@nsn.com> wrote: >> >> On Jan 24, 2013, at 19:41 , ext Jesse Gross wrote: >> >>> On Thu, Jan 24, 2013 at 7:34 AM, Jarno Rajahalme >>> <jarno.rajaha...@nsn.com> wrote: >>>> >>>> On Jan 23, 2013, at 19:30 , ext Jesse Gross wrote: >>>> >>>>> On Tue, Jan 22, 2013 at 9:48 PM, Jarno Rajahalme >>>>> <jarno.rajaha...@nsn.com> wrote: >>>>>> Add OVS_PACKET_ATTR_KEY_INFO to relieve userspace from re-computing >>>>>> data already computed within the kernel datapath. In the typical >>>>>> case of an upcall with perfect key fitness between kernel and >>>>>> userspace this eliminates flow_extract() and flow_hash() calls in >>>>>> handle_miss_upcalls(). >>>>>> >>>>>> Additional bookkeeping within the kernel datapath is minimal. >>>>>> Kernel flow insertion also saves one flow key hash computation. >>>>>> >>>>>> Removed setting the packet's l7 pointer for ICMP packets, as this was >>>>>> never used. >>>>>> >>>>>> Signed-off-by: Jarno Rajahalme <jarno.rajaha...@nsn.com> >>>>>> --- >>>>>> >>>>>> This likely requires some discussion, but it took a while for me to >>>>>> understand why each packet miss upcall would require flow_extract() >>>>>> right after the flow key has been obtained from odp attributes. >>>>> >>>>> Do you have any performance numbers to share? Since this is an >>>>> optimization it's important to understand if the benefit is worth the >>>>> extra complexity. >>>> >>>> Not yet, but would be happy to. Any hits towards for the best way of >>>> obtaining >>>> meaningful numbers for something like this? >>> >>> This is a flow setup optimization, so usually something like netperf >>> TCP_CRR would be a good way to stress that. >>> >>> However, Ben mentioned to me that he had tried eliminating the >>> flow_extract() call from userspace in the past and the results were >>> disappointing. >> >> I made a simple test, where there is only one flow entry "in_port=LOCAL >> actions=drop", and only the local port is configured. One process sends UDP >> packets with different source/destination port combinations in a loop. OVS >> then tries to cope with the load. During the test both processes run near >> 100% CPU utilization in a virtual machine on a dual-core laptop. On each >> round 10100000 packets were generated: >> >> OFPST_PORT reply (xid=0x2): 1 ports >> port LOCAL: rx pkts=10100006, bytes=464600468, drop=0, errs=0, frame=0, >> over=0, crc=0 >> tx pkts=0, bytes=0, drop=0, errs=0, coll=0 >> >> With current master 19.35% of packets on average get processed by the flow: >> >> Round 1: >> NXST_FLOW reply (xid=0x4): >> cookie=0x0, duration=29.124s, table=0, n_packets=1959794, n_bytes=90150548, >> idle_age=4, in_port=LOCAL actions=drop >> >> Round 2: >> NXST_FLOW reply (xid=0x4): >> cookie=0x0, duration=63.534s, table=0, n_packets=1932785, n_bytes=88908158, >> idle_age=37, in_port=LOCAL actions=drop >> >> Round 3: >> NXST_FLOW reply (xid=0x4): >> cookie=0x0, duration=33.394s, table=0, n_packets=1972389, n_bytes=90729894, >> idle_age=8, in_port=LOCAL actions=drop >> >> >> With the proposed change 20.2% of packets on average get processed by the >> flow: >> >> Round 4: >> NXST_FLOW reply (xid=0x4): >> cookie=0x0, duration=31.96s, table=0, n_packets=2042759, n_bytes=93966914, >> idle_age=4, in_port=LOCAL actions=drop >> >> Round 5: >> NXST_FLOW reply (xid=0x4): >> cookie=0x0, duration=38.6s, table=0, n_packets=2040224, n_bytes=93850372, >> idle_age=8, in_port=LOCAL actions=drop >> >> Round 6: >> NXST_FLOW reply (xid=0x4): >> cookie=0x0, duration=35.661s, table=0, n_packets=2039595, n_bytes=93821418, >> idle_age=3, in_port=LOCAL actions=drop >> >> >> So there is a consistent benefit, but it is not very large. Seemingly the >> flow_extract() and flow_hash() represent only a small portion of the OVS >> flow setup CPU use. > > Thanks for testing this out to get some concrete numbers. > > One thing that comes to mind is whether we actually use the layer > pointers in the packet all that often. It seems to me that in cases > where we are actually able to setup a kernel flow, userspace should be > able to do all of its work by only looking at that flow. The other > cases should be rare, like if userspace is directly consuming the > packet or if the fit is not perfect. If that's the case, could we get > a similar benefit without touching the userspace/kernel interface by > only initializing the pointers on demand?
Otherwise not so much use of the layer pointers, but dpif_flow_stats_extract() does packet_get_tcp_flags(), which needs the layer pointers. dpif_flow_stats_extract() is called on each miss packet (but the collected tcp_flags are used only with facets, for netflow and facet->tcp_flags). Other use is for controller action and special processing (stpp/cfm/lacp). So it seems that deferred layer pointer setting could be effective only if the stats extraction would be changed to skip tcp_flags for the cases where they are not going to be used (i.e. when not making facets, as might be expected under heavy load?). Also, I noticed that the facets hmap uses the hash from flow_miss, and according to the comments that should be flow_hash(flow, 0). However, the only place the hash originates in the code is at handle_miss_upcalls(), so again, there is no harm in using a kernel provided hash instead. But the comments need to be updated if this change is adopted. Jarno _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev