> On Feb 6, 2014, at 6:36 PM, Jesse Gross <je...@nicira.com> wrote: > >> On Thu, Feb 6, 2014 at 4:09 PM, Pravin Shelar <pshe...@nicira.com> wrote: >>> On Thu, Feb 6, 2014 at 3:13 PM, Jarno Rajahalme <jrajaha...@nicira.com> >>> wrote: >>> Keep kernel flow stats for each NUMA node rather than each (logical) >>> CPU. This avoids using the per-CPU allocator and removes most of the >>> kernel-side OVS locking overhead otherwise on the top of perf reports >>> and allows OVS to scale better with higher number of threads. >>> >>> With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup >>> rate doubles on a server with two hyper-threaded physical CPUs (16 >>> logical cores each) compared to the current OVS master. Tested with >>> non-trivial flow table with a TCP port match rule forcing all new >>> connections with unique port numbers to OVS userspace. The IP >>> addresses are still wildcarded, so the kernel flows are not considered >>> as exact match 5-tuple flows. This type of flows can be expected to >>> appear in large numbers as the result of more effective wildcarding >>> made possible by improvements in OVS userspace flow classifier. >>> >>> Perf results for this test (master): >>> >>> Events: 305K cycles >>> + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner >>> + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock >>> + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc >>> + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock >>> + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area >>> + 2.19% ovs-vswitchd ovs-vswitchd [.] >>> flow_hash_in_minimask_range >>> + 2.03% swapper [kernel.kallsyms] [k] intel_idle >>> + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock >>> + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup >>> + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6 >>> + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset >>> + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock >>> + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock >>> ... >>> >>> And after this patch: >>> >>> Events: 356K cycles >>> + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc >>> + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock >>> + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock >>> + 2.81% ovs-vswitchd ovs-vswitchd [.] >>> flow_hash_in_minimask_range >>> + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock >>> + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup >>> + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f >>> + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner >>> + 1.47% swapper [kernel.kallsyms] [k] intel_idle >>> + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask >>> + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref >>> + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash >>> + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions >>> + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref >>> + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock >>> ... >>> >>> There is a small increase in kernel spinlock overhead due to the same >>> spinlock being shared between multiple cores of the same physical CPU, >>> but that is barely visible in the netperf TCP_CRR test performance >>> (maybe ~1% performance drop, hard to tell exactly due to variance in the >>> test >>> results), when testing for kernel module throughput (with no userspace >>> activity, handful of kernel flows). >>> >>> On flow setup, a single stats instance is allocated (for the NUMA node >>> 0). As CPUs from multiple NUMA nodes start updating stats, new >>> NUMA-node specific stats instances are allocated. This allocation on >>> the packet processing code path is made to never sleep or look for >>> emergency memory pools, minimizing the allocation latency. If the >>> allocation fails, the existing preallocated stats instance is used. >>> Also, if only CPUs from one NUMA-node are updating the preallocated >>> stats instance, no additional stats instances are allocated. This >>> eliminates the need to pre-allocate stats instances that will not be >>> used, also relieving the stats reader from the burden of reading stats >>> that are never used. Finally, this allocation strategy allows the >>> removal of the existing exact-5-tuple heuristics. >>> >>> Signed-off-by: Jarno Rajahalme <jrajaha...@nicira.com> >> Looks good. >> >> Acked-by: Pravin B Shelar <pshe...@nicira.com> > > Jarno, would you mind giving me a chance to look at this again before > you apply it? I'll try to do that tomorrow.
Sure :-) Jarno _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev