On Thu, Feb 6, 2014 at 3:13 PM, Jarno Rajahalme <jrajaha...@nicira.com> wrote:
>     Keep kernel flow stats for each NUMA node rather than each (logical)
>     CPU.  This avoids using the per-CPU allocator and removes most of the
>     kernel-side OVS locking overhead otherwise on the top of perf reports
>     and allows OVS to scale better with higher number of threads.
>
>     With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
>     rate doubles on a server with two hyper-threaded physical CPUs (16
>     logical cores each) compared to the current OVS master.  Tested with
>     non-trivial flow table with a TCP port match rule forcing all new
>     connections with unique port numbers to OVS userspace.  The IP
>     addresses are still wildcarded, so the kernel flows are not considered
>     as exact match 5-tuple flows.  This type of flows can be expected to
>     appear in large numbers as the result of more effective wildcarding
>     made possible by improvements in OVS userspace flow classifier.
>
>     Perf results for this test (master):
>
>     Events: 305K cycles
>     +   8.43%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>     +   5.64%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
>     +   4.75%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
>     +   3.32%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
>     +   2.61%     ovs-vswitchd  [kernel.kallsyms]   [k] pcpu_alloc_area
>     +   2.19%     ovs-vswitchd  ovs-vswitchd        [.] 
> flow_hash_in_minimask_range
>     +   2.03%          swapper  [kernel.kallsyms]   [k] intel_idle
>     +   1.84%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
>     +   1.64%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
>     +   1.58%     ovs-vswitchd  libc-2.15.so        [.] 0x7f4e6
>     +   1.07%     ovs-vswitchd  [kernel.kallsyms]   [k] memset
>     +   1.03%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
>     +   0.92%          swapper  [kernel.kallsyms]   [k] __ticket_spin_lock
>     ...
>
>     And after this patch:
>
>     Events: 356K cycles
>     +   6.85%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
>     +   4.63%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
>     +   3.06%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
>     +   2.81%     ovs-vswitchd  ovs-vswitchd        [.] 
> flow_hash_in_minimask_range
>     +   2.51%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
>     +   2.27%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
>     +   1.84%     ovs-vswitchd  libc-2.15.so        [.] 0x15d30f
>     +   1.74%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>     +   1.47%          swapper  [kernel.kallsyms]   [k] intel_idle
>     +   1.34%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask
>     +   1.33%     ovs-vswitchd  ovs-vswitchd        [.] rule_actions_unref
>     +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] hindex_node_with_hash
>     +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] do_xlate_actions
>     +   1.09%     ovs-vswitchd  ovs-vswitchd        [.] ofproto_rule_ref
>     +   1.01%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
>     ...
>
>     There is a small increase in kernel spinlock overhead due to the same
>     spinlock being shared between multiple cores of the same physical CPU,
>     but that is barely visible in the netperf TCP_CRR test performance
>     (maybe ~1% performance drop, hard to tell exactly due to variance in the 
> test
>     results), when testing for kernel module throughput (with no userspace
>     activity, handful of kernel flows).
>
>     On flow setup, a single stats instance is allocated (for the NUMA node
>     0).  As CPUs from multiple NUMA nodes start updating stats, new
>     NUMA-node specific stats instances are allocated.  This allocation on
>     the packet processing code path is made to never sleep or look for
>     emergency memory pools, minimizing the allocation latency.  If the
>     allocation fails, the existing preallocated stats instance is used.
>     Also, if only CPUs from one NUMA-node are updating the preallocated
>     stats instance, no additional stats instances are allocated.  This
>     eliminates the need to pre-allocate stats instances that will not be
>     used, also relieving the stats reader from the burden of reading stats
>     that are never used.  Finally, this allocation strategy allows the
>     removal of the existing exact-5-tuple heuristics.
>
>     Signed-off-by: Jarno Rajahalme <jrajaha...@nicira.com>
Looks good.

Acked-by: Pravin B Shelar <pshe...@nicira.com>

Thanks.
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to