Keep kernel flow stats for each NUMA node rather than each (logical) CPU. This avoids using the per-CPU allocator and removes most of the kernel-side OVS locking overhead otherwise on the top of perf reports and allows OVS to scale better with higher number of threads.
With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup rate doubles on a server with two hyper-threaded physical CPUs (16 logical cores each) compared to the current OVS master. Tested with non-trivial flow table with a TCP port match rule forcing all new connections with unique port numbers to OVS userspace. The IP addresses are still wildcarded, so the kernel flows are not considered as exact match 5-tuple flows. This type of flows can be expected to appear in large numbers as the result of more effective wildcarding made possible by improvements in OVS userspace flow classifier. Perf results for this test (master): Events: 305K cycles + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area + 2.19% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range + 2.03% swapper [kernel.kallsyms] [k] intel_idle + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6 + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock ... And after this patch: Events: 356K cycles + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock + 2.81% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner + 1.47% swapper [kernel.kallsyms] [k] intel_idle + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock ... There is a small increase in kernel spinlock overhead due to the same spinlock being shared between multiple cores of the same physical CPU, but that is barely visible in the netperf TCP_CRR test performance (maybe ~1% performance drop, hard to tell exactly due to variance in the test results), when testing for kernel module throughput (with no userspace activity, handful of kernel flows). On flow setup, a single stats instance is allocated (for the NUMA node 0). As CPUs from multiple NUMA nodes start updating stats, new NUMA-node specific stats instances are allocated. This allocation on the packet processing code path is made to never block or look for emergency memory pools, minimizing the allocation latency. If the allocation fails, the existing preallocated stats instance is used. Also, if only CPUs from one NUMA-node are updating the preallocated stats instance, no additional stats instances are allocated. This eliminates the need to pre-allocate stats instances that will not be used, also relieving the stats reader from the burden of reading stats that are never used. Signed-off-by: Jarno Rajahalme <jrajaha...@nicira.com> --- v6: Addressed Jesses comments, and fixed locking: - Assume it is likely we will write to the preallocated stats. - Do not add cache line alignment to flow_cache, as it is mostly read. - Always block bottom halves (use spin_lock_bh()) as e.g. NODE 1 can keep writing on the preallocated (NODE 0) stats, which could otherwise lead to a deadlock. - Add a write barrier before storing the newly initialized stats pointer, as otherwise other CPUs could access the stats (lock) uninitialized. - Use the NUMA_NO_NODE from /include/linux/numa.h instead of -1 directly. - Make the most critical new per-NUMA code compile-time conditional (on MAX_NUMNODES) to avoid any overheads on kernels built for a single node only. v5: Addressed more comments by Pravin: - Removed prefetching as it may decrease performance when action execution takes more time (as with tunnels). - Disable preemption when using spin_lock() to avoid potential deadlocks. - Mark the flow stats cache as __read_mostly. v4: Addressed comments by Pravin: - Do not use __GFP_NOTRACK - Use spin_lock() instead of spin_lock_bh() when locking stats on remote nodes. - Be hotplug compatible by using num_possible_nodes() when allocating. - Remove unnecessary struct alignment attributes. datapath/flow.c | 157 ++++++++++++++++++++++++++++--------------------- datapath/flow.h | 9 ++- datapath/flow_table.c | 42 +++++++++---- datapath/flow_table.h | 2 + 4 files changed, 132 insertions(+), 78 deletions(-) diff --git a/datapath/flow.c b/datapath/flow.c index f40cddb..6ed5250 100644 --- a/datapath/flow.c +++ b/datapath/flow.c @@ -66,9 +66,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb) { struct flow_stats *stats; __be16 tcp_flags = 0; - - stats = this_cpu_ptr(flow->stats); - +#if MAX_NUMNODES > 1 + int node; +#endif if ((flow->key.eth.type == htons(ETH_P_IP) || flow->key.eth.type == htons(ETH_P_IPV6)) && flow->key.ip.frag != OVS_FRAG_TYPE_LATER && @@ -77,88 +77,113 @@ void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb) tcp_flags = TCP_FLAGS_BE16(tcp_hdr(skb)); } +#if !(MAX_NUMNODES > 1) + stats = flow->stats[0]; spin_lock(&stats->lock); +#else + node = numa_node_id(); + stats = flow->stats[node]; + /* Check if already have node-specific stats. */ + if (likely(stats)) + spin_lock(&stats->lock); + else { + stats = flow->stats[0]; /* Pre-allocated. */ + spin_lock(&stats->lock); + + /* If the current NUMA-node is the only writer on the + * pre-allocated stats keep using them. + * A previous locker may have already allocated the stats, + * so we need to check again. If the node-specific stats + * were already allocated, we update the pre-allocated stats + * as we have already locked them. */ + if (unlikely(stats->last_writer != node) + && likely(stats->last_writer != NUMA_NO_NODE) + && likely(!flow->stats[node])) { + /* Try to allocate node-specific stats. */ + struct flow_stats *new_stats; + + new_stats = kmem_cache_alloc_node(flow_stats_cache, + GFP_THISNODE | + __GFP_NOMEMALLOC, + node); + if (likely(new_stats)) { + new_stats->used = jiffies; + new_stats->packet_count = 1; + new_stats->byte_count = skb->len; + new_stats->tcp_flags = tcp_flags; + new_stats->last_writer = node; + spin_lock_init(&new_stats->lock); + /* Barrier needed to prevent others from + * accessing the lock uninitialized. */ + smp_wmb(); + flow->stats[node] = new_stats; + goto unlock; /* Unlock the pre-allocated stats. */ + } + } + } +#endif stats->used = jiffies; stats->packet_count++; stats->byte_count += skb->len; stats->tcp_flags |= tcp_flags; + stats->last_writer = node; +unlock: spin_unlock(&stats->lock); } -static void stats_read(struct flow_stats *stats, bool lock_bh, - struct ovs_flow_stats *ovs_stats, - unsigned long *used, __be16 *tcp_flags) -{ - if (lock_bh) - spin_lock_bh(&stats->lock); - else - spin_lock(&stats->lock); - - if (time_after(stats->used, *used)) - *used = stats->used; - *tcp_flags |= stats->tcp_flags; - ovs_stats->n_packets += stats->packet_count; - ovs_stats->n_bytes += stats->byte_count; - - if (lock_bh) - spin_unlock_bh(&stats->lock); - else - spin_unlock(&stats->lock); -} - void ovs_flow_stats_get(struct sw_flow *flow, struct ovs_flow_stats *ovs_stats, unsigned long *used, __be16 *tcp_flags) { - int cpu, cur_cpu; - - *used = 0; - *tcp_flags = 0; - memset(ovs_stats, 0, sizeof(*ovs_stats)); - - cur_cpu = get_cpu(); - - for_each_possible_cpu(cpu) { - struct flow_stats *stats; - bool lock_bh; +#if MAX_NUMNODES > 1 + int node; +#endif + struct flow_stats *stats; - stats = per_cpu_ptr(flow->stats, cpu); - lock_bh = (cpu == cur_cpu); - stats_read(stats, lock_bh, ovs_stats, used, tcp_flags); + stats = flow->stats[0]; /* Start from the preallocated node. */ + spin_lock_bh(&stats->lock); + *used = stats->used; + *tcp_flags = stats->tcp_flags; + ovs_stats->n_packets = stats->packet_count; + ovs_stats->n_bytes = stats->byte_count; + spin_unlock_bh(&stats->lock); + +#if MAX_NUMNODES > 1 + /* Collect stats from other nodes. */ + for_each_node(node) { + if (node == 0) + continue; /* Done already. */ + stats = flow->stats[node]; + if (stats) { + /* Local CPU may write on non-local stats, so we must + * block bottom-halves here. */ + spin_lock_bh(&stats->lock); + if (time_after(stats->used, *used)) + *used = stats->used; + *tcp_flags |= stats->tcp_flags; + ovs_stats->n_packets += stats->packet_count; + ovs_stats->n_bytes += stats->byte_count; + spin_unlock_bh(&stats->lock); + } } - put_cpu(); -} - -static void stats_reset(struct flow_stats *stats, bool lock_bh) -{ - if (lock_bh) - spin_lock_bh(&stats->lock); - else - spin_lock(&stats->lock); - - stats->used = 0; - stats->packet_count = 0; - stats->byte_count = 0; - stats->tcp_flags = 0; - - if (lock_bh) - spin_unlock_bh(&stats->lock); - else - spin_unlock(&stats->lock); +#endif } void ovs_flow_stats_clear(struct sw_flow *flow) { - int cpu, cur_cpu; - - cur_cpu = get_cpu(); - - for_each_possible_cpu(cpu) { - bool lock_bh; + int node; + struct flow_stats *stats; - lock_bh = (cpu == cur_cpu); - stats_reset(per_cpu_ptr(flow->stats, cpu), lock_bh); + for_each_node(node) { + stats = flow->stats[node]; + if (stats) { + spin_lock_bh(&stats->lock); + stats->used = 0; + stats->packet_count = 0; + stats->byte_count = 0; + stats->tcp_flags = 0; + spin_unlock_bh(&stats->lock); + } } - put_cpu(); } static int check_header(struct sk_buff *skb, int len) diff --git a/datapath/flow.h b/datapath/flow.h index c4de0e6..f6cce35 100644 --- a/datapath/flow.h +++ b/datapath/flow.h @@ -155,6 +155,9 @@ struct flow_stats { unsigned long used; /* Last used time (in jiffies). */ spinlock_t lock; /* Lock for atomic stats update. */ __be16 tcp_flags; /* Union of seen TCP flags. */ + int last_writer; /* NUMA-node id of the last writer or + * -1. Meaningful for 'stats[0]' only. + */ }; struct sw_flow { @@ -166,7 +169,11 @@ struct sw_flow { struct sw_flow_key unmasked_key; struct sw_flow_mask *mask; struct sw_flow_actions __rcu *sf_acts; - struct flow_stats __percpu *stats; + struct flow_stats *stats[]; /* One for each NUMA node. First one + * is allocated at flow creation time, + * the rest are allocated on demand + * while holding the 'stats[0].lock'. + */ }; struct arp_eth_header { diff --git a/datapath/flow_table.c b/datapath/flow_table.c index b7007ee..c04fb4d 100644 --- a/datapath/flow_table.c +++ b/datapath/flow_table.c @@ -50,6 +50,7 @@ #define REHASH_INTERVAL (10 * 60 * HZ) static struct kmem_cache *flow_cache; +struct kmem_cache *flow_stats_cache __read_mostly; static u16 range_n_bytes(const struct sw_flow_key_range *range) { @@ -77,7 +78,7 @@ void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src, struct sw_flow *ovs_flow_alloc(void) { struct sw_flow *flow; - int cpu; + int node; flow = kmem_cache_alloc(flow_cache, GFP_KERNEL); if (!flow) @@ -86,16 +87,19 @@ struct sw_flow *ovs_flow_alloc(void) flow->sf_acts = NULL; flow->mask = NULL; - flow->stats = alloc_percpu(struct flow_stats); - if (!flow->stats) + /* Initialize the default stat node. */ + flow->stats[0] = kmem_cache_alloc_node(flow_stats_cache, + GFP_KERNEL | __GFP_ZERO, 0); + if (!flow->stats[0]) goto err; - for_each_possible_cpu(cpu) { - struct flow_stats *cpu_stats; + spin_lock_init(&flow->stats[0]->lock); + flow->stats[0]->last_writer = NUMA_NO_NODE; + + for_each_node(node) + if (node != 0) + flow->stats[node] = NULL; - cpu_stats = per_cpu_ptr(flow->stats, cpu); - spin_lock_init(&cpu_stats->lock); - } return flow; err: kmem_cache_free(flow_cache, flow); @@ -132,8 +136,12 @@ static struct flex_array *alloc_buckets(unsigned int n_buckets) static void flow_free(struct sw_flow *flow) { + int node; + kfree((struct sf_flow_acts __force *)flow->sf_acts); - free_percpu(flow->stats); + for_each_node(node) + if (flow->stats[node]) + kmem_cache_free(flow_stats_cache, flow->stats[node]); kmem_cache_free(flow_cache, flow); } @@ -595,16 +603,28 @@ int ovs_flow_init(void) BUILD_BUG_ON(__alignof__(struct sw_flow_key) % __alignof__(long)); BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long)); - flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow), 0, - 0, NULL); + flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow) + + (num_possible_nodes() + * sizeof(struct flow_stats *)), + 0, 0, NULL); if (flow_cache == NULL) return -ENOMEM; + flow_stats_cache + = kmem_cache_create("sw_flow_stats", sizeof(struct flow_stats), + 0, SLAB_HWCACHE_ALIGN, NULL); + if (flow_stats_cache == NULL) { + kmem_cache_destroy(flow_cache); + flow_cache = NULL; + return -ENOMEM; + } + return 0; } /* Uninitializes the flow module. */ void ovs_flow_exit(void) { + kmem_cache_destroy(flow_stats_cache); kmem_cache_destroy(flow_cache); } diff --git a/datapath/flow_table.h b/datapath/flow_table.h index c26c59a..ca8a582 100644 --- a/datapath/flow_table.h +++ b/datapath/flow_table.h @@ -52,6 +52,8 @@ struct flow_table { unsigned int count; }; +extern struct kmem_cache *flow_stats_cache; + int ovs_flow_init(void); void ovs_flow_exit(void); -- 1.7.10.4 _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev