On Tue, Feb 11, 2014 at 4:07 PM, Jarno Rajahalme <jrajaha...@nicira.com> wrote: > Keep kernel flow stats for each NUMA node rather than each (logical) > CPU. This avoids using the per-CPU allocator and removes most of the > kernel-side OVS locking overhead otherwise on the top of perf reports > and allows OVS to scale better with higher number of threads. > > With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup > rate doubles on a server with two hyper-threaded physical CPUs (16 > logical cores each) compared to the current OVS master. Tested with > non-trivial flow table with a TCP port match rule forcing all new > connections with unique port numbers to OVS userspace. The IP > addresses are still wildcarded, so the kernel flows are not considered > as exact match 5-tuple flows. This type of flows can be expected to > appear in large numbers as the result of more effective wildcarding > made possible by improvements in OVS userspace flow classifier. > > Perf results for this test (master): > > Events: 305K cycles > + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner > + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock > + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc > + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock > + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area > + 2.19% ovs-vswitchd ovs-vswitchd [.] > flow_hash_in_minimask_range > + 2.03% swapper [kernel.kallsyms] [k] intel_idle > + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock > + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup > + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6 > + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset > + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock > + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock > ... > > And after this patch: > > Events: 356K cycles > + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc > + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock > + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock > + 2.81% ovs-vswitchd ovs-vswitchd [.] > flow_hash_in_minimask_range > + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock > + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup > + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f > + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner > + 1.47% swapper [kernel.kallsyms] [k] intel_idle > + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask > + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref > + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash > + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions > + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref > + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock > ... > > There is a small increase in kernel spinlock overhead due to the same > spinlock being shared between multiple cores of the same physical CPU, > but that is barely visible in the netperf TCP_CRR test performance > (maybe ~1% performance drop, hard to tell exactly due to variance in > the test results), when testing for kernel module throughput (with no > userspace activity, handful of kernel flows). > > On flow setup, a single stats instance is allocated (for the NUMA node > 0). As CPUs from multiple NUMA nodes start updating stats, new > NUMA-node specific stats instances are allocated. This allocation on > the packet processing code path is made to never block or look for > emergency memory pools, minimizing the allocation latency. If the > allocation fails, the existing preallocated stats instance is used. > Also, if only CPUs from one NUMA-node are updating the preallocated > stats instance, no additional stats instances are allocated. This > eliminates the need to pre-allocate stats instances that will not be > used, also relieving the stats reader from the burden of reading stats > that are never used. > > Signed-off-by: Jarno Rajahalme <jrajaha...@nicira.com> > > --- > v6: Addressed Jesses comments, and fixed locking: > - Assume it is likely we will write to the preallocated stats. > - Do not add cache line alignment to flow_cache, as it is mostly read. > - Always block bottom halves (use spin_lock_bh()) as e.g. NODE 1 can > keep writing on the preallocated (NODE 0) stats, which could > otherwise lead to a deadlock. > - Add a write barrier before storing the newly initialized stats > pointer, as otherwise other CPUs could access the stats (lock) > uninitialized. > - Use the NUMA_NO_NODE from /include/linux/numa.h instead of -1 > directly. > - Make the most critical new per-NUMA code compile-time conditional > (on MAX_NUMNODES) to avoid any overheads on kernels built for a > single node only. > > v5: Addressed more comments by Pravin: > - Removed prefetching as it may decrease performance when action execution > takes more time (as with tunnels). > - Disable preemption when using spin_lock() to avoid potential deadlocks. > - Mark the flow stats cache as __read_mostly. > > v4: Addressed comments by Pravin: > - Do not use __GFP_NOTRACK > - Use spin_lock() instead of spin_lock_bh() when locking stats on remote > nodes. > - Be hotplug compatible by using num_possible_nodes() when allocating. > - Remove unnecessary struct alignment attributes. > > datapath/flow.c | 157 > ++++++++++++++++++++++++++++--------------------- > datapath/flow.h | 9 ++- > datapath/flow_table.c | 42 +++++++++---- > datapath/flow_table.h | 2 + > 4 files changed, 132 insertions(+), 78 deletions(-) > > diff --git a/datapath/flow.c b/datapath/flow.c > index f40cddb..6ed5250 100644 > --- a/datapath/flow.c > +++ b/datapath/flow.c > @@ -66,9 +66,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, struct > sk_buff *skb) > { > struct flow_stats *stats; > __be16 tcp_flags = 0; > - > - stats = this_cpu_ptr(flow->stats); > - > +#if MAX_NUMNODES > 1 > + int node; > +#endif > if ((flow->key.eth.type == htons(ETH_P_IP) || > flow->key.eth.type == htons(ETH_P_IPV6)) && > flow->key.ip.frag != OVS_FRAG_TYPE_LATER && > @@ -77,88 +77,113 @@ void ovs_flow_stats_update(struct sw_flow *flow, struct > sk_buff *skb) > tcp_flags = TCP_FLAGS_BE16(tcp_hdr(skb)); > } > > +#if !(MAX_NUMNODES > 1) > + stats = flow->stats[0]; > spin_lock(&stats->lock); > +#else > + node = numa_node_id(); > + stats = flow->stats[node]; > + /* Check if already have node-specific stats. */ > + if (likely(stats)) > + spin_lock(&stats->lock); > + else { > + stats = flow->stats[0]; /* Pre-allocated. */ > + spin_lock(&stats->lock); > + > + /* If the current NUMA-node is the only writer on the > + * pre-allocated stats keep using them. > + * A previous locker may have already allocated the stats, > + * so we need to check again. If the node-specific stats > + * were already allocated, we update the pre-allocated stats > + * as we have already locked them. */ > + if (unlikely(stats->last_writer != node) > + && likely(stats->last_writer != NUMA_NO_NODE) > + && likely(!flow->stats[node])) { > + /* Try to allocate node-specific stats. */ > + struct flow_stats *new_stats; > + > + new_stats = kmem_cache_alloc_node(flow_stats_cache, > + GFP_THISNODE | > + __GFP_NOMEMALLOC, > + node); > + if (likely(new_stats)) { > + new_stats->used = jiffies; > + new_stats->packet_count = 1; > + new_stats->byte_count = skb->len; > + new_stats->tcp_flags = tcp_flags; > + new_stats->last_writer = node; > + spin_lock_init(&new_stats->lock); > + /* Barrier needed to prevent others from > + * accessing the lock uninitialized. */ > + smp_wmb(); > + flow->stats[node] = new_stats; > + goto unlock; /* Unlock the pre-allocated > stats. */ > + } > + } > + } > +#endif
How much overhead do we have for non-NUMA systems without this compile time check. Since we always allocate node zero stat I do not think update stats has that much overhead for that case. > stats->used = jiffies; > stats->packet_count++; > stats->byte_count += skb->len; > stats->tcp_flags |= tcp_flags; > + stats->last_writer = node; > +unlock: > spin_unlock(&stats->lock); > } > > -static void stats_read(struct flow_stats *stats, bool lock_bh, > - struct ovs_flow_stats *ovs_stats, > - unsigned long *used, __be16 *tcp_flags) > -{ > - if (lock_bh) > - spin_lock_bh(&stats->lock); > - else > - spin_lock(&stats->lock); > - > - if (time_after(stats->used, *used)) > - *used = stats->used; > - *tcp_flags |= stats->tcp_flags; > - ovs_stats->n_packets += stats->packet_count; > - ovs_stats->n_bytes += stats->byte_count; > - > - if (lock_bh) > - spin_unlock_bh(&stats->lock); > - else > - spin_unlock(&stats->lock); > -} > - > void ovs_flow_stats_get(struct sw_flow *flow, struct ovs_flow_stats > *ovs_stats, > unsigned long *used, __be16 *tcp_flags) > { > - int cpu, cur_cpu; > - > - *used = 0; > - *tcp_flags = 0; > - memset(ovs_stats, 0, sizeof(*ovs_stats)); > - > - cur_cpu = get_cpu(); > - > - for_each_possible_cpu(cpu) { > - struct flow_stats *stats; > - bool lock_bh; > +#if MAX_NUMNODES > 1 > + int node; > +#endif > + struct flow_stats *stats; > > - stats = per_cpu_ptr(flow->stats, cpu); > - lock_bh = (cpu == cur_cpu); > - stats_read(stats, lock_bh, ovs_stats, used, tcp_flags); > + stats = flow->stats[0]; /* Start from the preallocated node. */ > + spin_lock_bh(&stats->lock); > + *used = stats->used; > + *tcp_flags = stats->tcp_flags; > + ovs_stats->n_packets = stats->packet_count; > + ovs_stats->n_bytes = stats->byte_count; > + spin_unlock_bh(&stats->lock); > + Is there reason for reading node zero stat outside of for_each_node() {} loop? > +#if MAX_NUMNODES > 1 > + /* Collect stats from other nodes. */ > + for_each_node(node) { > + if (node == 0) > + continue; /* Done already. */ > + stats = flow->stats[node]; You need smp_read_barrier_depends() barrier here. As Jesse suggested we can use higher level of API, like rcu_dereference/assign. > + if (stats) { > + /* Local CPU may write on non-local stats, so we must > + * block bottom-halves here. */ > + spin_lock_bh(&stats->lock); > + if (time_after(stats->used, *used)) > + *used = stats->used; > + *tcp_flags |= stats->tcp_flags; > + ovs_stats->n_packets += stats->packet_count; > + ovs_stats->n_bytes += stats->byte_count; > + spin_unlock_bh(&stats->lock); > + } > } > - put_cpu(); > -} > - > -static void stats_reset(struct flow_stats *stats, bool lock_bh) > -{ > - if (lock_bh) > - spin_lock_bh(&stats->lock); > - else > - spin_lock(&stats->lock); > - > - stats->used = 0; > - stats->packet_count = 0; > - stats->byte_count = 0; > - stats->tcp_flags = 0; > - > - if (lock_bh) > - spin_unlock_bh(&stats->lock); > - else > - spin_unlock(&stats->lock); > +#endif > } > > void ovs_flow_stats_clear(struct sw_flow *flow) > { > - int cpu, cur_cpu; > - > - cur_cpu = get_cpu(); > - > - for_each_possible_cpu(cpu) { > - bool lock_bh; > + int node; > + struct flow_stats *stats; > > - lock_bh = (cpu == cur_cpu); > - stats_reset(per_cpu_ptr(flow->stats, cpu), lock_bh); > + for_each_node(node) { > + stats = flow->stats[node]; > + if (stats) { > + spin_lock_bh(&stats->lock); > + stats->used = 0; > + stats->packet_count = 0; > + stats->byte_count = 0; > + stats->tcp_flags = 0; > + spin_unlock_bh(&stats->lock); > + } > } > - put_cpu(); > } > > static int check_header(struct sk_buff *skb, int len) > diff --git a/datapath/flow.h b/datapath/flow.h > index c4de0e6..f6cce35 100644 > --- a/datapath/flow.h > +++ b/datapath/flow.h > @@ -155,6 +155,9 @@ struct flow_stats { > unsigned long used; /* Last used time (in jiffies). */ > spinlock_t lock; /* Lock for atomic stats update. */ > __be16 tcp_flags; /* Union of seen TCP flags. */ > + int last_writer; /* NUMA-node id of the last writer or > + * -1. Meaningful for 'stats[0]' only. > + */ > }; > I did not understood your comment from last thread about moving it to struct sw_flow. Since flow->stat[0] is only shared why we need last_writer in each flow_stats? Therefore we should be able to move it to struct sw_flow. > struct sw_flow { > @@ -166,7 +169,11 @@ struct sw_flow { > struct sw_flow_key unmasked_key; > struct sw_flow_mask *mask; > struct sw_flow_actions __rcu *sf_acts; > - struct flow_stats __percpu *stats; > + struct flow_stats *stats[]; /* One for each NUMA node. First one > + * is allocated at flow creation time, > + * the rest are allocated on demand > + * while holding the 'stats[0].lock'. > + */ > }; > > struct arp_eth_header { > diff --git a/datapath/flow_table.c b/datapath/flow_table.c > index b7007ee..c04fb4d 100644 > --- a/datapath/flow_table.c > +++ b/datapath/flow_table.c > @@ -50,6 +50,7 @@ > #define REHASH_INTERVAL (10 * 60 * HZ) > > static struct kmem_cache *flow_cache; > +struct kmem_cache *flow_stats_cache __read_mostly; > > static u16 range_n_bytes(const struct sw_flow_key_range *range) > { > @@ -77,7 +78,7 @@ void ovs_flow_mask_key(struct sw_flow_key *dst, const > struct sw_flow_key *src, > struct sw_flow *ovs_flow_alloc(void) > { > struct sw_flow *flow; > - int cpu; > + int node; > > flow = kmem_cache_alloc(flow_cache, GFP_KERNEL); > if (!flow) > @@ -86,16 +87,19 @@ struct sw_flow *ovs_flow_alloc(void) > flow->sf_acts = NULL; > flow->mask = NULL; > > - flow->stats = alloc_percpu(struct flow_stats); > - if (!flow->stats) > + /* Initialize the default stat node. */ > + flow->stats[0] = kmem_cache_alloc_node(flow_stats_cache, > + GFP_KERNEL | __GFP_ZERO, 0); > + if (!flow->stats[0]) > goto err; > > - for_each_possible_cpu(cpu) { > - struct flow_stats *cpu_stats; > + spin_lock_init(&flow->stats[0]->lock); > + flow->stats[0]->last_writer = NUMA_NO_NODE; > + > + for_each_node(node) > + if (node != 0) > + flow->stats[node] = NULL; > > - cpu_stats = per_cpu_ptr(flow->stats, cpu); > - spin_lock_init(&cpu_stats->lock); > - } > return flow; > err: > kmem_cache_free(flow_cache, flow); > @@ -132,8 +136,12 @@ static struct flex_array *alloc_buckets(unsigned int > n_buckets) > > static void flow_free(struct sw_flow *flow) > { > + int node; > + > kfree((struct sf_flow_acts __force *)flow->sf_acts); > - free_percpu(flow->stats); > + for_each_node(node) > + if (flow->stats[node]) > + kmem_cache_free(flow_stats_cache, flow->stats[node]); > kmem_cache_free(flow_cache, flow); > } > > @@ -595,16 +603,28 @@ int ovs_flow_init(void) > BUILD_BUG_ON(__alignof__(struct sw_flow_key) % __alignof__(long)); > BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long)); > > - flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow), 0, > - 0, NULL); > + flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow) > + + (num_possible_nodes() > + * sizeof(struct flow_stats *)), > + 0, 0, NULL); > if (flow_cache == NULL) > return -ENOMEM; > > + flow_stats_cache > + = kmem_cache_create("sw_flow_stats", sizeof(struct > flow_stats), > + 0, SLAB_HWCACHE_ALIGN, NULL); > + if (flow_stats_cache == NULL) { > + kmem_cache_destroy(flow_cache); > + flow_cache = NULL; > + return -ENOMEM; > + } > + > return 0; > } > > /* Uninitializes the flow module. */ > void ovs_flow_exit(void) > { > + kmem_cache_destroy(flow_stats_cache); > kmem_cache_destroy(flow_cache); > } > diff --git a/datapath/flow_table.h b/datapath/flow_table.h > index c26c59a..ca8a582 100644 > --- a/datapath/flow_table.h > +++ b/datapath/flow_table.h > @@ -52,6 +52,8 @@ struct flow_table { > unsigned int count; > }; > > +extern struct kmem_cache *flow_stats_cache; > + > int ovs_flow_init(void); > void ovs_flow_exit(void); > > -- > 1.7.10.4 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > http://openvswitch.org/mailman/listinfo/dev _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev