Re: [ovs-dev] [PATCH v2 02/13] datapath: Per NUMA node flow stats.

Pravin Shelar Wed, 12 Feb 2014 16:12:36 -0800

On Tue, Feb 11, 2014 at 4:07 PM, Jarno Rajahalme <jrajaha...@nicira.com> wrote:
> Keep kernel flow stats for each NUMA node rather than each (logical)
> CPU.  This avoids using the per-CPU allocator and removes most of the
> kernel-side OVS locking overhead otherwise on the top of perf reports
> and allows OVS to scale better with higher number of threads.
>
> With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
> rate doubles on a server with two hyper-threaded physical CPUs (16
> logical cores each) compared to the current OVS master.  Tested with
> non-trivial flow table with a TCP port match rule forcing all new
> connections with unique port numbers to OVS userspace.  The IP
> addresses are still wildcarded, so the kernel flows are not considered
> as exact match 5-tuple flows.  This type of flows can be expected to
> appear in large numbers as the result of more effective wildcarding
> made possible by improvements in OVS userspace flow classifier.
>
> Perf results for this test (master):
>
> Events: 305K cycles
> +   8.43%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
> +   5.64%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
> +   4.75%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
> +   3.32%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
> +   2.61%     ovs-vswitchd  [kernel.kallsyms]   [k] pcpu_alloc_area
> +   2.19%     ovs-vswitchd  ovs-vswitchd        [.] 
> flow_hash_in_minimask_range
> +   2.03%          swapper  [kernel.kallsyms]   [k] intel_idle
> +   1.84%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
> +   1.64%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
> +   1.58%     ovs-vswitchd  libc-2.15.so        [.] 0x7f4e6
> +   1.07%     ovs-vswitchd  [kernel.kallsyms]   [k] memset
> +   1.03%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
> +   0.92%          swapper  [kernel.kallsyms]   [k] __ticket_spin_lock
> ...
>
> And after this patch:
>
> Events: 356K cycles
> +   6.85%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
> +   4.63%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
> +   3.06%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
> +   2.81%     ovs-vswitchd  ovs-vswitchd        [.] 
> flow_hash_in_minimask_range
> +   2.51%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
> +   2.27%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
> +   1.84%     ovs-vswitchd  libc-2.15.so        [.] 0x15d30f
> +   1.74%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
> +   1.47%          swapper  [kernel.kallsyms]   [k] intel_idle
> +   1.34%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask
> +   1.33%     ovs-vswitchd  ovs-vswitchd        [.] rule_actions_unref
> +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] hindex_node_with_hash
> +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] do_xlate_actions
> +   1.09%     ovs-vswitchd  ovs-vswitchd        [.] ofproto_rule_ref
> +   1.01%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
> ...
>
> There is a small increase in kernel spinlock overhead due to the same
> spinlock being shared between multiple cores of the same physical CPU,
> but that is barely visible in the netperf TCP_CRR test performance
> (maybe ~1% performance drop, hard to tell exactly due to variance in
> the test results), when testing for kernel module throughput (with no
> userspace activity, handful of kernel flows).
>
> On flow setup, a single stats instance is allocated (for the NUMA node
> 0).  As CPUs from multiple NUMA nodes start updating stats, new
> NUMA-node specific stats instances are allocated.  This allocation on
> the packet processing code path is made to never block or look for
> emergency memory pools, minimizing the allocation latency.  If the
> allocation fails, the existing preallocated stats instance is used.
> Also, if only CPUs from one NUMA-node are updating the preallocated
> stats instance, no additional stats instances are allocated.  This
> eliminates the need to pre-allocate stats instances that will not be
> used, also relieving the stats reader from the burden of reading stats
> that are never used.
>
> Signed-off-by: Jarno Rajahalme <jrajaha...@nicira.com>
>
> ---
> v6: Addressed Jesses comments, and fixed locking:
> - Assume it is likely we will write to the preallocated stats.
> - Do not add cache line alignment to flow_cache, as it is mostly read.
> - Always block bottom halves (use spin_lock_bh()) as e.g. NODE 1 can
>   keep writing on the preallocated (NODE 0) stats, which could
>   otherwise lead to a deadlock.
> - Add a write barrier before storing the newly initialized stats
>   pointer, as otherwise other CPUs could access the stats (lock)
>   uninitialized.
> - Use the NUMA_NO_NODE from /include/linux/numa.h instead of -1
>   directly.
> - Make the most critical new per-NUMA code compile-time conditional
>   (on MAX_NUMNODES) to avoid any overheads on kernels built for a
>   single node only.
>
> v5: Addressed more comments by Pravin:
> - Removed prefetching as it may decrease performance when action execution
>   takes more time (as with tunnels).
> - Disable preemption when using spin_lock() to avoid potential deadlocks.
> - Mark the flow stats cache as __read_mostly.
>
> v4: Addressed comments by Pravin:
> - Do not use __GFP_NOTRACK
> - Use spin_lock() instead of spin_lock_bh() when locking stats on remote 
> nodes.
> - Be hotplug compatible by using num_possible_nodes() when allocating.
> - Remove unnecessary struct alignment attributes.
>
>  datapath/flow.c       |  157 
> ++++++++++++++++++++++++++++---------------------
>  datapath/flow.h       |    9 ++-
>  datapath/flow_table.c |   42 +++++++++----
>  datapath/flow_table.h |    2 +
>  4 files changed, 132 insertions(+), 78 deletions(-)
>
> diff --git a/datapath/flow.c b/datapath/flow.c
> index f40cddb..6ed5250 100644
> --- a/datapath/flow.c
> +++ b/datapath/flow.c
> @@ -66,9 +66,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, struct 
> sk_buff *skb)
>  {
>         struct flow_stats *stats;
>         __be16 tcp_flags = 0;
> -
> -       stats = this_cpu_ptr(flow->stats);
> -
> +#if MAX_NUMNODES > 1
> +       int node;
> +#endif
>         if ((flow->key.eth.type == htons(ETH_P_IP) ||
>              flow->key.eth.type == htons(ETH_P_IPV6)) &&
>             flow->key.ip.frag != OVS_FRAG_TYPE_LATER &&
> @@ -77,88 +77,113 @@ void ovs_flow_stats_update(struct sw_flow *flow, struct 
> sk_buff *skb)
>                 tcp_flags = TCP_FLAGS_BE16(tcp_hdr(skb));
>         }
>
> +#if !(MAX_NUMNODES > 1)
> +       stats = flow->stats[0];
>         spin_lock(&stats->lock);
> +#else
> +       node = numa_node_id();
> +       stats = flow->stats[node];
> +       /* Check if already have node-specific stats. */
> +       if (likely(stats))
> +               spin_lock(&stats->lock);
> +       else {
> +               stats = flow->stats[0]; /* Pre-allocated. */
> +               spin_lock(&stats->lock);
> +
> +               /* If the current NUMA-node is the only writer on the
> +                * pre-allocated stats keep using them.
> +                * A previous locker may have already allocated the stats,
> +                * so we need to check again.  If the node-specific stats
> +                * were already allocated, we update the pre-allocated stats
> +                * as we have already locked them. */
> +               if (unlikely(stats->last_writer != node)
> +                   && likely(stats->last_writer != NUMA_NO_NODE)
> +                   && likely(!flow->stats[node])) {
> +                       /* Try to allocate node-specific stats. */
> +                       struct flow_stats *new_stats;
> +
> +                       new_stats = kmem_cache_alloc_node(flow_stats_cache,
> +                                                         GFP_THISNODE |
> +                                                         __GFP_NOMEMALLOC,
> +                                                         node);
> +                       if (likely(new_stats)) {
> +                               new_stats->used = jiffies;
> +                               new_stats->packet_count = 1;
> +                               new_stats->byte_count = skb->len;
> +                               new_stats->tcp_flags = tcp_flags;
> +                               new_stats->last_writer = node;
> +                               spin_lock_init(&new_stats->lock);
> +                               /* Barrier needed to prevent others from
> +                                * accessing the lock uninitialized. */
> +                               smp_wmb();
> +                               flow->stats[node] = new_stats;
> +                               goto unlock; /* Unlock the pre-allocated 
> stats. */
> +                       }
> +               }
> +       }
> +#endif


How much overhead do we have for non-NUMA systems without this compile
time check.
Since we always allocate node zero stat I do not think update stats
has that much overhead for that case.

>         stats->used = jiffies;
>         stats->packet_count++;
>         stats->byte_count += skb->len;
>         stats->tcp_flags |= tcp_flags;
> +       stats->last_writer = node;
> +unlock:
>         spin_unlock(&stats->lock);
>  }
>
> -static void stats_read(struct flow_stats *stats, bool lock_bh,
> -                      struct ovs_flow_stats *ovs_stats,
> -                      unsigned long *used, __be16 *tcp_flags)
> -{
> -       if (lock_bh)
> -               spin_lock_bh(&stats->lock);
> -       else
> -               spin_lock(&stats->lock);
> -
> -       if (time_after(stats->used, *used))
> -               *used = stats->used;
> -       *tcp_flags |= stats->tcp_flags;
> -       ovs_stats->n_packets += stats->packet_count;
> -       ovs_stats->n_bytes += stats->byte_count;
> -
> -       if (lock_bh)
> -               spin_unlock_bh(&stats->lock);
> -       else
> -               spin_unlock(&stats->lock);
> -}
> -
>  void ovs_flow_stats_get(struct sw_flow *flow, struct ovs_flow_stats 
> *ovs_stats,
>                         unsigned long *used, __be16 *tcp_flags)
>  {
> -       int cpu, cur_cpu;
> -
> -       *used = 0;
> -       *tcp_flags = 0;
> -       memset(ovs_stats, 0, sizeof(*ovs_stats));
> -
> -       cur_cpu = get_cpu();
> -
> -       for_each_possible_cpu(cpu) {
> -               struct flow_stats *stats;
> -               bool lock_bh;
> +#if MAX_NUMNODES > 1
> +       int node;
> +#endif
> +       struct flow_stats *stats;
>
> -               stats = per_cpu_ptr(flow->stats, cpu);
> -               lock_bh = (cpu == cur_cpu);
> -               stats_read(stats, lock_bh, ovs_stats, used, tcp_flags);
> +       stats = flow->stats[0]; /* Start from the preallocated node. */
> +       spin_lock_bh(&stats->lock);
> +       *used = stats->used;
> +       *tcp_flags = stats->tcp_flags;
> +       ovs_stats->n_packets = stats->packet_count;
> +       ovs_stats->n_bytes = stats->byte_count;
> +       spin_unlock_bh(&stats->lock);
> +
Is there reason for reading node zero stat outside of for_each_node() {} loop?

> +#if MAX_NUMNODES > 1
> +       /* Collect stats from other nodes. */
> +       for_each_node(node) {
> +               if (node == 0)
> +                       continue; /* Done already. */
> +               stats = flow->stats[node];

You need smp_read_barrier_depends() barrier here.
As Jesse suggested we can use higher level of API, like rcu_dereference/assign.

> +               if (stats) {
> +                       /* Local CPU may write on non-local stats, so we must
> +                        * block bottom-halves here. */
> +                       spin_lock_bh(&stats->lock);
> +                       if (time_after(stats->used, *used))
> +                               *used = stats->used;
> +                       *tcp_flags |= stats->tcp_flags;
> +                       ovs_stats->n_packets += stats->packet_count;
> +                       ovs_stats->n_bytes += stats->byte_count;
> +                       spin_unlock_bh(&stats->lock);
> +               }
>         }
> -       put_cpu();
> -}
> -
> -static void stats_reset(struct flow_stats *stats, bool lock_bh)
> -{
> -       if (lock_bh)
> -               spin_lock_bh(&stats->lock);
> -       else
> -               spin_lock(&stats->lock);
> -
> -       stats->used = 0;
> -       stats->packet_count = 0;
> -       stats->byte_count = 0;
> -       stats->tcp_flags = 0;
> -
> -       if (lock_bh)
> -               spin_unlock_bh(&stats->lock);
> -       else
> -               spin_unlock(&stats->lock);
> +#endif
>  }
>
>  void ovs_flow_stats_clear(struct sw_flow *flow)
>  {
> -       int cpu, cur_cpu;
> -
> -       cur_cpu = get_cpu();
> -
> -       for_each_possible_cpu(cpu) {
> -               bool lock_bh;
> +       int node;
> +       struct flow_stats *stats;
>
> -               lock_bh = (cpu == cur_cpu);
> -               stats_reset(per_cpu_ptr(flow->stats, cpu), lock_bh);
> +       for_each_node(node) {
> +               stats = flow->stats[node];
> +               if (stats) {
> +                       spin_lock_bh(&stats->lock);
> +                       stats->used = 0;
> +                       stats->packet_count = 0;
> +                       stats->byte_count = 0;
> +                       stats->tcp_flags = 0;
> +                       spin_unlock_bh(&stats->lock);
> +               }
>         }
> -       put_cpu();
>  }
>
>  static int check_header(struct sk_buff *skb, int len)
> diff --git a/datapath/flow.h b/datapath/flow.h
> index c4de0e6..f6cce35 100644
> --- a/datapath/flow.h
> +++ b/datapath/flow.h
> @@ -155,6 +155,9 @@ struct flow_stats {
>         unsigned long used;             /* Last used time (in jiffies). */
>         spinlock_t lock;                /* Lock for atomic stats update. */
>         __be16 tcp_flags;               /* Union of seen TCP flags. */
> +       int last_writer;                /* NUMA-node id of the last writer or
> +                                        * -1. Meaningful for 'stats[0]' only.
> +                                        */
>  };
>
I did not understood your comment from last thread about moving it to
struct sw_flow.
Since flow->stat[0] is only shared why we need last_writer in each flow_stats?
Therefore we should be able to move it to struct sw_flow.

>  struct sw_flow {
> @@ -166,7 +169,11 @@ struct sw_flow {
>         struct sw_flow_key unmasked_key;
>         struct sw_flow_mask *mask;
>         struct sw_flow_actions __rcu *sf_acts;
> -       struct flow_stats __percpu *stats;
> +       struct flow_stats *stats[];     /* One for each NUMA node.  First one
> +                                        * is allocated at flow creation time,
> +                                        * the rest are allocated on demand
> +                                        * while holding the 'stats[0].lock'.
> +                                        */
>  };
>
>  struct arp_eth_header {
> diff --git a/datapath/flow_table.c b/datapath/flow_table.c
> index b7007ee..c04fb4d 100644
> --- a/datapath/flow_table.c
> +++ b/datapath/flow_table.c
> @@ -50,6 +50,7 @@
>  #define REHASH_INTERVAL                (10 * 60 * HZ)
>
>  static struct kmem_cache *flow_cache;
> +struct kmem_cache *flow_stats_cache __read_mostly;
>
>  static u16 range_n_bytes(const struct sw_flow_key_range *range)
>  {
> @@ -77,7 +78,7 @@ void ovs_flow_mask_key(struct sw_flow_key *dst, const 
> struct sw_flow_key *src,
>  struct sw_flow *ovs_flow_alloc(void)
>  {
>         struct sw_flow *flow;
> -       int cpu;
> +       int node;
>
>         flow = kmem_cache_alloc(flow_cache, GFP_KERNEL);
>         if (!flow)
> @@ -86,16 +87,19 @@ struct sw_flow *ovs_flow_alloc(void)
>         flow->sf_acts = NULL;
>         flow->mask = NULL;
>
> -       flow->stats = alloc_percpu(struct flow_stats);
> -       if (!flow->stats)
> +       /* Initialize the default stat node. */
> +       flow->stats[0] = kmem_cache_alloc_node(flow_stats_cache,
> +                                              GFP_KERNEL | __GFP_ZERO, 0);
> +       if (!flow->stats[0])
>                 goto err;
>
> -       for_each_possible_cpu(cpu) {
> -               struct flow_stats *cpu_stats;
> +       spin_lock_init(&flow->stats[0]->lock);
> +       flow->stats[0]->last_writer = NUMA_NO_NODE;
> +
> +       for_each_node(node)
> +               if (node != 0)
> +                       flow->stats[node] = NULL;
>
> -               cpu_stats = per_cpu_ptr(flow->stats, cpu);
> -               spin_lock_init(&cpu_stats->lock);
> -       }
>         return flow;
>  err:
>         kmem_cache_free(flow_cache, flow);
> @@ -132,8 +136,12 @@ static struct flex_array *alloc_buckets(unsigned int 
> n_buckets)
>
>  static void flow_free(struct sw_flow *flow)
>  {
> +       int node;
> +
>         kfree((struct sf_flow_acts __force *)flow->sf_acts);
> -       free_percpu(flow->stats);
> +       for_each_node(node)
> +               if (flow->stats[node])
> +                       kmem_cache_free(flow_stats_cache, flow->stats[node]);
>         kmem_cache_free(flow_cache, flow);
>  }
>
> @@ -595,16 +603,28 @@ int ovs_flow_init(void)
>         BUILD_BUG_ON(__alignof__(struct sw_flow_key) % __alignof__(long));
>         BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long));
>
> -       flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow), 0,
> -                                       0, NULL);
> +       flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
> +                                      + (num_possible_nodes()
> +                                         * sizeof(struct flow_stats *)),
> +                                      0, 0, NULL);
>         if (flow_cache == NULL)
>                 return -ENOMEM;
>
> +       flow_stats_cache
> +               = kmem_cache_create("sw_flow_stats", sizeof(struct 
> flow_stats),
> +                                   0, SLAB_HWCACHE_ALIGN, NULL);
> +       if (flow_stats_cache == NULL) {
> +               kmem_cache_destroy(flow_cache);
> +               flow_cache = NULL;
> +               return -ENOMEM;
> +       }
> +
>         return 0;
>  }
>
>  /* Uninitializes the flow module. */
>  void ovs_flow_exit(void)
>  {
> +       kmem_cache_destroy(flow_stats_cache);
>         kmem_cache_destroy(flow_cache);
>  }
> diff --git a/datapath/flow_table.h b/datapath/flow_table.h
> index c26c59a..ca8a582 100644
> --- a/datapath/flow_table.h
> +++ b/datapath/flow_table.h
> @@ -52,6 +52,8 @@ struct flow_table {
>         unsigned int count;
>  };
>
> +extern struct kmem_cache *flow_stats_cache;
> +
>  int ovs_flow_init(void);
>  void ovs_flow_exit(void);
>
> --
> 1.7.10.4
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Re: [ovs-dev] [PATCH v2 02/13] datapath: Per NUMA node flow stats.

Reply via email to