On Jan 30, 2013, at 2:59 , ext Jesse Gross wrote:

> On Tue, Jan 29, 2013 at 7:10 AM, Rajahalme, Jarno (NSN - FI/Espoo)
> <jarno.rajaha...@nsn.com> wrote:
>> 
>> On Jan 24, 2013, at 19:41 , ext Jesse Gross wrote:
>> 
>>> On Thu, Jan 24, 2013 at 7:34 AM, Jarno Rajahalme
>>> <jarno.rajaha...@nsn.com> wrote:
>>>> 
>>>> On Jan 23, 2013, at 19:30 , ext Jesse Gross wrote:
>>>> 
>>>>> On Tue, Jan 22, 2013 at 9:48 PM, Jarno Rajahalme
>>>>> <jarno.rajaha...@nsn.com> wrote:
>>>>>> Add OVS_PACKET_ATTR_KEY_INFO to relieve userspace from re-computing
>>>>>> data already computed within the kernel datapath.  In the typical
>>>>>> case of an upcall with perfect key fitness between kernel and
>>>>>> userspace this eliminates flow_extract() and flow_hash() calls in
>>>>>> handle_miss_upcalls().
>>>>>> 
>>>>>> Additional bookkeeping within the kernel datapath is minimal.
>>>>>> Kernel flow insertion also saves one flow key hash computation.
>>>>>> 
>>>>>> Removed setting the packet's l7 pointer for ICMP packets, as this was
>>>>>> never used.
>>>>>> 
>>>>>> Signed-off-by: Jarno Rajahalme <jarno.rajaha...@nsn.com>
>>>>>> ---
>>>>>> 
>>>>>> This likely requires some discussion, but it took a while for me to
>>>>>> understand why each packet miss upcall would require flow_extract()
>>>>>> right after the flow key has been obtained from odp attributes.
>>>>> 
>>>>> Do you have any performance numbers to share?  Since this is an
>>>>> optimization it's important to understand if the benefit is worth the
>>>>> extra complexity.
>>>> 
>>>> Not yet, but would be happy to. Any hits towards for the best way of 
>>>> obtaining
>>>> meaningful numbers for something like this?
>>> 
>>> This is a flow setup optimization, so usually something like netperf
>>> TCP_CRR would be a good way to stress that.
>>> 
>>> However, Ben mentioned to me that he had tried eliminating the
>>> flow_extract() call from userspace in the past and the results were
>>> disappointing.
>> 
>> I made a simple test, where there is only one flow entry "in_port=LOCAL 
>> actions=drop", and only the local port is configured. One process sends UDP 
>> packets with different source/destination port combinations in a loop. OVS 
>> then tries to cope with the load. During the test both processes run near 
>> 100% CPU utilization in a virtual machine on a dual-core laptop. On each 
>> round 10100000 packets were generated:
>> 
>> OFPST_PORT reply (xid=0x2): 1 ports
>>  port LOCAL: rx pkts=10100006, bytes=464600468, drop=0, errs=0, frame=0, 
>> over=0, crc=0
>>           tx pkts=0, bytes=0, drop=0, errs=0, coll=0
>> 
>> With current master 19.35% of packets on average get processed by the flow:
>> 
>> Round 1:
>> NXST_FLOW reply (xid=0x4):
>> cookie=0x0, duration=29.124s, table=0, n_packets=1959794, n_bytes=90150548, 
>> idle_age=4, in_port=LOCAL actions=drop
>> 
>> Round 2:
>> NXST_FLOW reply (xid=0x4):
>> cookie=0x0, duration=63.534s, table=0, n_packets=1932785, n_bytes=88908158, 
>> idle_age=37, in_port=LOCAL actions=drop
>> 
>> Round 3:
>> NXST_FLOW reply (xid=0x4):
>> cookie=0x0, duration=33.394s, table=0, n_packets=1972389, n_bytes=90729894, 
>> idle_age=8, in_port=LOCAL actions=drop
>> 
>> 
>> With the proposed change 20.2% of packets on average get processed by the 
>> flow:
>> 
>> Round 4:
>> NXST_FLOW reply (xid=0x4):
>> cookie=0x0, duration=31.96s, table=0, n_packets=2042759, n_bytes=93966914, 
>> idle_age=4, in_port=LOCAL actions=drop
>> 
>> Round 5:
>> NXST_FLOW reply (xid=0x4):
>> cookie=0x0, duration=38.6s, table=0, n_packets=2040224, n_bytes=93850372, 
>> idle_age=8, in_port=LOCAL actions=drop
>> 
>> Round 6:
>> NXST_FLOW reply (xid=0x4):
>> cookie=0x0, duration=35.661s, table=0, n_packets=2039595, n_bytes=93821418, 
>> idle_age=3, in_port=LOCAL actions=drop
>> 
>> 
>> So there is a consistent benefit, but it is not very large. Seemingly the 
>> flow_extract() and flow_hash() represent only a small portion of the OVS 
>> flow setup CPU use.
> 
> Thanks for testing this out to get some concrete numbers.
> 
> One thing that comes to mind is whether we actually use the layer
> pointers in the packet all that often.  It seems to me that in cases
> where we are actually able to setup a kernel flow, userspace should be
> able to do all of its work by only looking at that flow.  The other
> cases should be rare, like if userspace is directly consuming the
> packet or if the fit is not perfect.  If that's the case, could we get
> a similar benefit without touching the userspace/kernel interface by
> only initializing the pointers on demand?

Otherwise not so much use of the layer pointers, but dpif_flow_stats_extract()
does packet_get_tcp_flags(), which needs the layer pointers.
dpif_flow_stats_extract() is called on each miss packet (but the collected
tcp_flags are used only with facets, for netflow and facet->tcp_flags).
Other use is for controller action and special processing (stpp/cfm/lacp).

So it seems that deferred layer pointer setting could be effective only if the
stats extraction would be changed to skip tcp_flags for the cases where they
are not going to be used (i.e. when not making facets, as might be expected
under heavy load?).

Also, I noticed that the facets hmap uses the hash from flow_miss, and
according to the comments that should be flow_hash(flow, 0). However,
the only place the hash originates in the code is at handle_miss_upcalls(),
so again, there is no harm in using a kernel provided hash instead. But the
comments need to be updated if this change is adopted.

  Jarno

_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to