On Thu, Feb 6, 2014 at 4:09 PM, Pravin Shelar <pshe...@nicira.com> wrote: > On Thu, Feb 6, 2014 at 3:13 PM, Jarno Rajahalme <jrajaha...@nicira.com> wrote: >> Keep kernel flow stats for each NUMA node rather than each (logical) >> CPU. This avoids using the per-CPU allocator and removes most of the >> kernel-side OVS locking overhead otherwise on the top of perf reports >> and allows OVS to scale better with higher number of threads. >> >> With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup >> rate doubles on a server with two hyper-threaded physical CPUs (16 >> logical cores each) compared to the current OVS master. Tested with >> non-trivial flow table with a TCP port match rule forcing all new >> connections with unique port numbers to OVS userspace. The IP >> addresses are still wildcarded, so the kernel flows are not considered >> as exact match 5-tuple flows. This type of flows can be expected to >> appear in large numbers as the result of more effective wildcarding >> made possible by improvements in OVS userspace flow classifier. >> >> Perf results for this test (master): >> >> Events: 305K cycles >> + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner >> + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock >> + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc >> + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock >> + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area >> + 2.19% ovs-vswitchd ovs-vswitchd [.] >> flow_hash_in_minimask_range >> + 2.03% swapper [kernel.kallsyms] [k] intel_idle >> + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock >> + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup >> + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6 >> + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset >> + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock >> + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock >> ... >> >> And after this patch: >> >> Events: 356K cycles >> + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc >> + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock >> + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock >> + 2.81% ovs-vswitchd ovs-vswitchd [.] >> flow_hash_in_minimask_range >> + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock >> + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup >> + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f >> + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner >> + 1.47% swapper [kernel.kallsyms] [k] intel_idle >> + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask >> + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref >> + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash >> + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions >> + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref >> + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock >> ... >> >> There is a small increase in kernel spinlock overhead due to the same >> spinlock being shared between multiple cores of the same physical CPU, >> but that is barely visible in the netperf TCP_CRR test performance >> (maybe ~1% performance drop, hard to tell exactly due to variance in the >> test >> results), when testing for kernel module throughput (with no userspace >> activity, handful of kernel flows). >> >> On flow setup, a single stats instance is allocated (for the NUMA node >> 0). As CPUs from multiple NUMA nodes start updating stats, new >> NUMA-node specific stats instances are allocated. This allocation on >> the packet processing code path is made to never sleep or look for >> emergency memory pools, minimizing the allocation latency. If the >> allocation fails, the existing preallocated stats instance is used. >> Also, if only CPUs from one NUMA-node are updating the preallocated >> stats instance, no additional stats instances are allocated. This >> eliminates the need to pre-allocate stats instances that will not be >> used, also relieving the stats reader from the burden of reading stats >> that are never used. Finally, this allocation strategy allows the >> removal of the existing exact-5-tuple heuristics. >> >> Signed-off-by: Jarno Rajahalme <jrajaha...@nicira.com> > Looks good. > > Acked-by: Pravin B Shelar <pshe...@nicira.com>
Jarno, would you mind giving me a chance to look at this again before you apply it? I'll try to do that tomorrow. _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev