On Tue, Feb 4, 2014 at 11:03 AM, Jarno Rajahalme <jrajaha...@nicira.com> wrote:
> Keep kernel flow stats for each NUMA node rather than each (logical)
> CPU.  This avoids using the per-CPU allocator and removes most of the
> kernel-side OVS locking overhead otherwise on the top of perf report
> and allows OVS to scale better with higher number of threads.
>
> With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
> rate doubles on a server with two hyper-threaded physical CPUs (16
> logical cores each) compared to the current OVS master.  Tested with
> non-trivial flow table with a TCP port match rule forcing all new
> connections with unique port numbers to OVS userspace.  The IP
> addresses are still wildcarded, so the kernel flows are not considered
> as exact match 5-tuple flows.  This type of flows can be expected to
> appear in large numbers as the result of more effective wildcarding
> made possible by improvements in OVS userspace flow classifier.
>
> Perf results for this test (master):
>
> Events: 305K cycles
> +   8.43%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
> +   5.64%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
> +   4.75%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
> +   3.32%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
> +   2.61%     ovs-vswitchd  [kernel.kallsyms]   [k] pcpu_alloc_area
> +   2.19%     ovs-vswitchd  ovs-vswitchd        [.] 
> flow_hash_in_minimask_range
> +   2.03%          swapper  [kernel.kallsyms]   [k] intel_idle
> +   1.84%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
> +   1.64%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
> +   1.58%     ovs-vswitchd  libc-2.15.so        [.] 0x7f4e6
> +   1.07%     ovs-vswitchd  [kernel.kallsyms]   [k] memset
> +   1.03%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
> +   0.92%          swapper  [kernel.kallsyms]   [k] __ticket_spin_lock
> ...
>
> And after this patch:
>
> Events: 356K cycles
> +   6.85%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
> +   4.63%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
> +   3.06%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
> +   2.81%     ovs-vswitchd  ovs-vswitchd        [.] 
> flow_hash_in_minimask_range
> +   2.51%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
> +   2.27%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
> +   1.84%     ovs-vswitchd  libc-2.15.so        [.] 0x15d30f
> +   1.74%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
> +   1.47%          swapper  [kernel.kallsyms]   [k] intel_idle
> +   1.34%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask
> +   1.33%     ovs-vswitchd  ovs-vswitchd        [.] rule_actions_unref
> +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] hindex_node_with_hash
> +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] do_xlate_actions
> +   1.09%     ovs-vswitchd  ovs-vswitchd        [.] ofproto_rule_ref
> +   1.01%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
> ...
>
> There is a small increase in kernel spinlock overhead due to the same
> spinlock being shared between multiple cores of the same physical CPU,
> but that is barely visible in the netperf TCP_CRR test performance
> (maybe ~1% performance drop, hard to tell exactly due to variance in the test
> results), when testing for kernel module throughput (with no userspace
> activity, handful of kernel flows).
>
> This overhead is visible in percentage of all CPU resources attributable to
> ovs_flow_stats_update() from 'perf' results:
>
> per-CPU (master): 0.2 %
> per-NUMA (this patch): 0.7 %
> per-NUMA, no prefetch: 0.9 %
> shared: 1.9 %
> shared, no prefetch: 2 %
>
> I have included the figures for the case where there is simply only
> one instance of stats for each flow ('shared') for comparison.  In
> summary, accessing memory for shared stats 10 times more costly than
> per-CPU stats.  Per-NUMA stats is in between the two, being about 3
> times more costly than per-CPU stats and one third of the cost of
> shared stats.  Prefetching makes a small but noticeable contribution
> to the performance of per-NUMA stats.
>
> On flow setup, a single stats instance is allocated (for the NUMA node
> 0).  As CPUs from multiple NUMA nodes start updating stats, new
> NUMA-node specific stats instances are allocated.  This allocation on
> the packet processing code path is made to never sleep or look for
> emergency memory pools, minimizing the allocation latency.  If the
> allocation fails, the existing preallocated stats instance is used.
> Also, if only CPUs from one NUMA-node are updating the preallocated
> stats instance, no additional stats instances are allocated.  This
> eliminates the need to pre-allocate stats instances that will not be
> used, also relieving the stats reader from the burden of reading stats
> that are never used.  Finally, this allocation strategy allows the
> removal of the existing exact-5-tuple heuristics.
>
> Signed-off-by: Jarno Rajahalme <jrajaha...@nicira.com>
>
> Summary:
> ---
> v4: Addressed comments by Pravin:
> - Do not use __GFP_NOTRACK
> - Use spin_lock() instead of spin_lock_bh() when locking stats on remote 
> nodes.
> - Be hotplug compatible by using num_possible_nodes() when allocating.
> - Remove unnecessary struct alignment attributes.
>
>  datapath/datapath.c     |   38 ++++++++--
>  datapath/flow.c         |  185 
> +++++++++++++++++++++++++----------------------
>  datapath/flow.h         |   21 +++---
>  datapath/flow_netlink.c |   58 +--------------
>  datapath/flow_netlink.h |    1 -
>  datapath/flow_table.c   |   57 ++++++++-------
>  datapath/flow_table.h   |    4 +-
>  7 files changed, 179 insertions(+), 185 deletions(-)
>
> diff --git a/datapath/datapath.c b/datapath/datapath.c
> index 5f1b34c..b7485c1 100644
> --- a/datapath/datapath.c
> +++ b/datapath/datapath.c
> @@ -214,6 +214,29 @@ void ovs_dp_detach_port(struct vport *p)
>         ovs_vport_del(p);
>  }
>
> +static void get_skb_stats(struct flow_stats *stat, const struct sw_flow 
> *flow,
> +                         const struct sk_buff *skb)
> +{
> +       int node = numa_node_id();
> +
> +       /* Prepare for writing later. */
> +       if (unlikely(!flow->stats[node]))
> +               node = 0;
> +       spin_lock_prefetch(&flow->stats[node]->lock);
> +
> +       stat->used = jiffies;
> +       stat->packet_count = 1;
> +       stat->byte_count = skb->len;
> +       stat->tcp_flags = 0;
> +
> +       if ((flow->key.eth.type == htons(ETH_P_IP) ||
> +            flow->key.eth.type == htons(ETH_P_IPV6)) &&
> +           flow->key.ip.proto == IPPROTO_TCP &&
> +           likely(skb->len >= skb_transport_offset(skb) + sizeof(struct 
> tcphdr))) {
> +               stat->tcp_flags = TCP_FLAGS_BE16(tcp_hdr(skb));
> +       }
code indentation.

> +}
> +
>  /* Must be called with rcu_read_lock. */
>  void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb)
>  {
> @@ -221,6 +244,7 @@ void ovs_dp_process_received_packet(struct vport *p, 
> struct sk_buff *skb)
>         struct sw_flow *flow;
>         struct dp_stats_percpu *stats;
>         struct sw_flow_key key;
> +       struct flow_stats stat;
>         u64 *stats_counter;
>         u32 n_mask_hit;
>         int error;
> @@ -252,8 +276,9 @@ void ovs_dp_process_received_packet(struct vport *p, 
> struct sk_buff *skb)
>         OVS_CB(skb)->flow = flow;
>         OVS_CB(skb)->pkt_key = &key;
>
> -       ovs_flow_stats_update(OVS_CB(skb)->flow, skb);
> +       get_skb_stats(&stat, flow, skb);
>         ovs_execute_actions(dp, skb);
> +       ovs_flow_stats_update(flow, &stat);
>         stats_counter = &stats->n_hit;
>
>  out:
> @@ -522,7 +547,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, 
> struct genl_info *info)
>                 packet->protocol = htons(ETH_P_802_2);
>
>         /* Build an sw_flow for sending this packet. */
> -       flow = ovs_flow_alloc(false);
> +       flow = ovs_flow_alloc();
>         err = PTR_ERR(flow);
>         if (IS_ERR(flow))
>                 goto err_kfree_skb;
> @@ -780,7 +805,6 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, 
> struct genl_info *info)
>         struct datapath *dp;
>         struct sw_flow_actions *acts = NULL;
>         struct sw_flow_match match;
> -       bool exact_5tuple;
>         int error;
>
>         /* Extract key. */
> @@ -789,7 +813,7 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, 
> struct genl_info *info)
>                 goto error;
>
>         ovs_match_init(&match, &key, &mask);
> -       error = ovs_nla_get_match(&match, &exact_5tuple,
> +       error = ovs_nla_get_match(&match,
>                                   a[OVS_FLOW_ATTR_KEY], 
> a[OVS_FLOW_ATTR_MASK]);
>         if (error)
>                 goto error;
> @@ -828,7 +852,7 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, 
> struct genl_info *info)
>                         goto err_unlock_ovs;
>
>                 /* Allocate flow. */
> -               flow = ovs_flow_alloc(!exact_5tuple);
> +               flow = ovs_flow_alloc();
>                 if (IS_ERR(flow)) {
>                         error = PTR_ERR(flow);
>                         goto err_unlock_ovs;
> @@ -912,7 +936,7 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct 
> genl_info *info)
>         }
>
>         ovs_match_init(&match, &key, NULL);
> -       err = ovs_nla_get_match(&match, NULL, a[OVS_FLOW_ATTR_KEY], NULL);
> +       err = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY], NULL);
>         if (err)
>                 return err;
>
> @@ -966,7 +990,7 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct 
> genl_info *info)
>         }
>
>         ovs_match_init(&match, &key, NULL);
> -       err = ovs_nla_get_match(&match, NULL, a[OVS_FLOW_ATTR_KEY], NULL);
> +       err = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY], NULL);
>         if (err)
>                 goto unlock;
>
> diff --git a/datapath/flow.c b/datapath/flow.c
> index 8be3801..7b9faab 100644
> --- a/datapath/flow.c
> +++ b/datapath/flow.c
> @@ -47,6 +47,26 @@
>
>  #include "vlan.h"
>
> +static inline bool alloc_stats(struct sw_flow *flow,
> +                              const struct flow_stats *stat, int node)
> +{
> +       struct flow_stats *stats;
> +       /* Alloc stats for the 'node', but with minimal latency. */
> +       stats = kmem_cache_alloc_node(flow_stats_cache, GFP_THISNODE |
> +                                     __GFP_NOMEMALLOC, node);

Generally a line is kept blank after local variable definition.

> +       if (likely(stats)) {
> +               stats->used = stat->used;
> +               stats->packet_count = stat->packet_count;
> +               stats->byte_count = stat->byte_count;
> +               stats->tcp_flags = stat->tcp_flags;
> +               spin_lock_init(&stats->lock);
> +               stats->last_writer = node;
> +
> +               flow->stats[node] = stats;
> +       }
> +       return stats != NULL;
> +}
> +
>  u64 ovs_flow_used_time(unsigned long flow_jiffies)
>  {
>         struct timespec cur_ts;
> @@ -60,114 +80,105 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies)
>         return cur_ms - idle_ms;
>  }
>
> -#define TCP_FLAGS_BE16(tp) (*(__be16 *)&tcp_flag_word(tp) & htons(0x0FFF))
> -
> -void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb)
> +void ovs_flow_stats_update(struct sw_flow *flow, const struct flow_stats 
> *stat)
>  {
> -       struct flow_stats *stats;
> -       __be16 tcp_flags = 0;
> -
> -       if (!flow->stats.is_percpu)
> -               stats = flow->stats.stat;
> -       else
> -               stats = this_cpu_ptr(flow->stats.cpu_stats);
> -
> -       if ((flow->key.eth.type == htons(ETH_P_IP) ||
> -            flow->key.eth.type == htons(ETH_P_IPV6)) &&
> -           flow->key.ip.proto == IPPROTO_TCP &&
> -           likely(skb->len >= skb_transport_offset(skb) + sizeof(struct 
> tcphdr))) {
> -               tcp_flags = TCP_FLAGS_BE16(tcp_hdr(skb));
> -       }
> +       int node = numa_node_id();
> +       struct flow_stats *stats = flow->stats[node];
>
> -       spin_lock(&stats->lock);
> -       stats->used = jiffies;
> -       stats->packet_count++;
> -       stats->byte_count += skb->len;
> -       stats->tcp_flags |= tcp_flags;
> -       spin_unlock(&stats->lock);
> -}
> +       if (unlikely(!stats)) {
> +               bool done = false;
> +               stats = flow->stats[0]; /* Pre-allocated. */
>
> -static void stats_read(struct flow_stats *stats, bool lock_bh,
> -                      struct ovs_flow_stats *ovs_stats,
> -                      unsigned long *used, __be16 *tcp_flags)
> -{
> -       if (lock_bh)
> -               spin_lock_bh(&stats->lock);
> -       else
>                 spin_lock(&stats->lock);
> +               /* If the current NUMA-node is the only writer on the 'stats'
> +                * keep using it.
> +                * A previous locker may have already allocated the stats. */
> +               if (stats->last_writer != node && stats->last_writer >= 0
> +                   && likely(!flow->stats[node]))
> +                       done = alloc_stats(flow, stat, node);
> +               if (!done)
> +                       goto update;
> +               goto unlock;
> +       }
>
> -       if (time_after(stats->used, *used))
> -               *used = stats->used;
> -       *tcp_flags |= stats->tcp_flags;
> -       ovs_stats->n_packets += stats->packet_count;
> -       ovs_stats->n_bytes += stats->byte_count;
> -
> -       if (lock_bh)
> -               spin_unlock_bh(&stats->lock);
> -       else
> -               spin_unlock(&stats->lock);
> +       spin_lock(&stats->lock);
> +update:
> +       stats->used = stat->used;
> +       stats->packet_count += stat->packet_count;
> +       stats->byte_count += stat->byte_count;
> +       stats->tcp_flags |= stat->tcp_flags;
> +       stats->last_writer = node;
> +unlock:
> +       spin_unlock(&stats->lock);
>  }
>
>  void ovs_flow_stats_get(struct sw_flow *flow, struct ovs_flow_stats 
> *ovs_stats,
>                         unsigned long *used, __be16 *tcp_flags)
>  {
> -       int cpu, cur_cpu;
> -
> -       *used = 0;
> -       *tcp_flags = 0;
> -       memset(ovs_stats, 0, sizeof(*ovs_stats));
> +       int node, local_node = numa_node_id();
> +       struct flow_stats *stats;
>
> -       if (!flow->stats.is_percpu) {
> -               stats_read(flow->stats.stat, true, ovs_stats, used, 
> tcp_flags);
> +       /* Start from the current node, if any, as it is more likely to be
> +        * in the cache by now.  Avoid locking when there are no stats. */
> +       stats = flow->stats[local_node];
> +       if (stats && likely(stats->packet_count)) {
> +               spin_lock_bh(&stats->lock);
> +               *used = stats->used;
> +               *tcp_flags = stats->tcp_flags;
> +               ovs_stats->n_packets = stats->packet_count;
> +               ovs_stats->n_bytes = stats->byte_count;
> +               spin_unlock_bh(&stats->lock);
>         } else {
> -               cur_cpu = get_cpu();
> -
> -               for_each_possible_cpu(cpu) {
> -                       struct flow_stats *stats;
> -                       bool lock_bh;
> +               *used = 0;
> +               *tcp_flags = 0;
> +               ovs_stats->n_packets = 0;
> +               ovs_stats->n_bytes = 0;
> +       }
>
> -                       stats = per_cpu_ptr(flow->stats.cpu_stats, cpu);
> -                       lock_bh = (cpu == cur_cpu);
> -                       stats_read(stats, lock_bh, ovs_stats, used, 
> tcp_flags);
> +       /* Collect per NUMA-node stats from other nodes. */
> +       for_each_node(node) {
> +               if (node == local_node)
> +                       continue; /* Done already. */
> +               stats = flow->stats[node];
> +               if (stats && likely(stats->packet_count)) {
> +                       spin_lock(&stats->lock);
> +                       if (time_after(stats->used, *used))
> +                               *used = stats->used;
> +                       *tcp_flags |= stats->tcp_flags;
> +                       ovs_stats->n_packets += stats->packet_count;
> +                       ovs_stats->n_bytes += stats->byte_count;
> +                       spin_unlock(&stats->lock);
>                 }
> -               put_cpu();

This need to disable preemption. otherwise kernel can reschedule this
thread on other numa node and that can cause deadlock.

>         }
>  }
>
> -static void stats_reset(struct flow_stats *stats, bool lock_bh)
> -{
> -       if (lock_bh)
> -               spin_lock_bh(&stats->lock);
> -       else
> -               spin_lock(&stats->lock);
> -
> -       stats->used = 0;
> -       stats->packet_count = 0;
> -       stats->byte_count = 0;
> -       stats->tcp_flags = 0;
> -
> -       if (lock_bh)
> -               spin_unlock_bh(&stats->lock);
> -       else
> -               spin_unlock(&stats->lock);
> -}
> -
>  void ovs_flow_stats_clear(struct sw_flow *flow)
>  {
> -       int cpu, cur_cpu;
> -
> -       if (!flow->stats.is_percpu) {
> -               stats_reset(flow->stats.stat, true);
> -       } else {
> -               cur_cpu = get_cpu();
> -
> -               for_each_possible_cpu(cpu) {
> -                       bool lock_bh;
> +       int node, local_node = numa_node_id();
> +       struct flow_stats *stats;
>
> -                       lock_bh = (cpu == cur_cpu);
> -                       stats_reset(per_cpu_ptr(flow->stats.cpu_stats, cpu), 
> lock_bh);
> +       /* Start from the current node. */
> +       stats = flow->stats[local_node];
> +       if (stats) {
> +               spin_lock_bh(&stats->lock);
> +               stats->used = 0;
> +               stats->packet_count = 0;
> +               stats->byte_count = 0;
> +               stats->tcp_flags = 0;
> +               spin_unlock_bh(&stats->lock);
> +       }
> +       for_each_node(node) {
> +               if (node == local_node)
> +                       continue; /* Done already. */
> +               stats = flow->stats[node];
> +               if (stats) {
> +                       spin_lock(&stats->lock);
> +                       stats->used = 0;
> +                       stats->packet_count = 0;
> +                       stats->byte_count = 0;
> +                       stats->tcp_flags = 0;
> +                       spin_unlock(&stats->lock);
>                 }
> -               put_cpu();
>         }
same as above.

>  }
>
> diff --git a/datapath/flow.h b/datapath/flow.h
> index eafcfd8..4a11914 100644
> --- a/datapath/flow.h
> +++ b/datapath/flow.h
> @@ -67,6 +67,8 @@ static inline void ovs_flow_tun_key_init(struct 
> ovs_key_ipv4_tunnel *tun_key,
>                sizeof(*tun_key) - OVS_TUNNEL_KEY_SIZE);
>  }
>
> +#define TCP_FLAGS_BE16(tp) (*(__be16 *)&tcp_flag_word(tp) & htons(0x0FFF))
> +
>  struct sw_flow_key {
>         struct ovs_key_ipv4_tunnel tun_key;  /* Encapsulating tunnel key. */
>         struct {
> @@ -155,14 +157,9 @@ struct flow_stats {
>         unsigned long used;             /* Last used time (in jiffies). */
>         spinlock_t lock;                /* Lock for atomic stats update. */
>         __be16 tcp_flags;               /* Union of seen TCP flags. */
> -};
> -
> -struct sw_flow_stats {
> -       bool is_percpu;
> -       union {
> -               struct flow_stats *stat;
> -               struct flow_stats __percpu *cpu_stats;
> -       };
> +       int last_writer;                /* NUMA-node id of the last writer or
> +                                        * -1. Meaningful for 'stats[0]' only.
> +                                        */
>  };
>
>  struct sw_flow {
> @@ -174,7 +171,11 @@ struct sw_flow {
>         struct sw_flow_key unmasked_key;
>         struct sw_flow_mask *mask;
>         struct sw_flow_actions __rcu *sf_acts;
> -       struct sw_flow_stats stats;
> +       struct flow_stats *stats[];     /* One for each NUMA node.  First one
> +                                        * is allocated at flow creation time,
> +                                        * the rest are allocated on demand
> +                                        * while holding the 'stats[0].lock'.
> +                                        */
>  };
>
>  struct arp_eth_header {
> @@ -191,7 +192,7 @@ struct arp_eth_header {
>         unsigned char       ar_tip[4];          /* target IP address        */
>  } __packed;
>
> -void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb);
> +void ovs_flow_stats_update(struct sw_flow *flow, const struct flow_stats 
> *stat);
>  void ovs_flow_stats_get(struct sw_flow *flow, struct ovs_flow_stats *stats,
>                         unsigned long *used, __be16 *tcp_flags);
>  void ovs_flow_stats_clear(struct sw_flow *flow);
> diff --git a/datapath/flow_netlink.c b/datapath/flow_netlink.c
> index 39fe4bf..3024a61 100644
> --- a/datapath/flow_netlink.c
> +++ b/datapath/flow_netlink.c
> @@ -268,20 +268,6 @@ static bool is_all_zero(const u8 *fp, size_t size)
>         return true;
>  }
>
> -static bool is_all_set(const u8 *fp, size_t size)
> -{
> -       int i;
> -
> -       if (!fp)
> -               return false;
> -
> -       for (i = 0; i < size; i++)
> -               if (fp[i] != 0xff)
> -                       return false;
> -
> -       return true;
> -}
> -
>  static int __parse_flow_nlattrs(const struct nlattr *attr,
>                                 const struct nlattr *a[],
>                                 u64 *attrsp, bool nz)
> @@ -503,9 +489,8 @@ static int metadata_from_nlattrs(struct sw_flow_match 
> *match,  u64 *attrs,
>         return 0;
>  }
>
> -static int ovs_key_from_nlattrs(struct sw_flow_match *match,  bool 
> *exact_5tuple,
> -                               u64 attrs, const struct nlattr **a,
> -                               bool is_mask)
> +static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs,
> +                               const struct nlattr **a, bool is_mask)
>  {
>         int err;
>         u64 orig_attrs = attrs;
> @@ -562,11 +547,6 @@ static int ovs_key_from_nlattrs(struct sw_flow_match 
> *match,  bool *exact_5tuple
>                 SW_FLOW_KEY_PUT(match, eth.type, htons(ETH_P_802_2), is_mask);
>         }
>
> -       if (is_mask && exact_5tuple) {
> -               if (match->mask->key.eth.type != htons(0xffff))
> -                       *exact_5tuple = false;
> -       }
> -
>         if (attrs & (1ULL << OVS_KEY_ATTR_IPV4)) {
>                 const struct ovs_key_ipv4 *ipv4_key;
>
> @@ -589,13 +569,6 @@ static int ovs_key_from_nlattrs(struct sw_flow_match 
> *match,  bool *exact_5tuple
>                 SW_FLOW_KEY_PUT(match, ipv4.addr.dst,
>                                 ipv4_key->ipv4_dst, is_mask);
>                 attrs &= ~(1ULL << OVS_KEY_ATTR_IPV4);
> -
> -               if (is_mask && exact_5tuple && *exact_5tuple) {
> -                       if (ipv4_key->ipv4_proto != 0xff ||
> -                           ipv4_key->ipv4_src != htonl(0xffffffff) ||
> -                           ipv4_key->ipv4_dst != htonl(0xffffffff))
> -                               *exact_5tuple = false;
> -               }
>         }
>
>         if (attrs & (1ULL << OVS_KEY_ATTR_IPV6)) {
> @@ -627,15 +600,6 @@ static int ovs_key_from_nlattrs(struct sw_flow_match 
> *match,  bool *exact_5tuple
>                                 is_mask);
>
>                 attrs &= ~(1ULL << OVS_KEY_ATTR_IPV6);
> -
> -               if (is_mask && exact_5tuple && *exact_5tuple) {
> -                       if (ipv6_key->ipv6_proto != 0xff ||
> -                           !is_all_set((const u8 *)ipv6_key->ipv6_src,
> -                                       sizeof(match->key->ipv6.addr.src)) ||
> -                           !is_all_set((const u8 *)ipv6_key->ipv6_dst,
> -                                       sizeof(match->key->ipv6.addr.dst)))
> -                               *exact_5tuple = false;
> -               }
>         }
>
>         if (attrs & (1ULL << OVS_KEY_ATTR_ARP)) {
> @@ -678,11 +642,6 @@ static int ovs_key_from_nlattrs(struct sw_flow_match 
> *match,  bool *exact_5tuple
>                                         tcp_key->tcp_dst, is_mask);
>                 }
>                 attrs &= ~(1ULL << OVS_KEY_ATTR_TCP);
> -
> -               if (is_mask && exact_5tuple && *exact_5tuple &&
> -                   (tcp_key->tcp_src != htons(0xffff) ||
> -                    tcp_key->tcp_dst != htons(0xffff)))
> -                       *exact_5tuple = false;
>         }
>
>         if (attrs & (1ULL << OVS_KEY_ATTR_TCP_FLAGS)) {
> @@ -714,11 +673,6 @@ static int ovs_key_from_nlattrs(struct sw_flow_match 
> *match,  bool *exact_5tuple
>                                         udp_key->udp_dst, is_mask);
>                 }
>                 attrs &= ~(1ULL << OVS_KEY_ATTR_UDP);
> -
> -               if (is_mask && exact_5tuple && *exact_5tuple &&
> -                   (udp_key->udp_src != htons(0xffff) ||
> -                    udp_key->udp_dst != htons(0xffff)))
> -                       *exact_5tuple = false;
>         }
>
>         if (attrs & (1ULL << OVS_KEY_ATTR_SCTP)) {
> @@ -804,7 +758,6 @@ static void sw_flow_mask_set(struct sw_flow_mask *mask,
>   * attribute specifies the mask field of the wildcarded flow.
>   */
>  int ovs_nla_get_match(struct sw_flow_match *match,
> -                     bool *exact_5tuple,
>                       const struct nlattr *key,
>                       const struct nlattr *mask)
>  {
> @@ -852,13 +805,10 @@ int ovs_nla_get_match(struct sw_flow_match *match,
>                 }
>         }
>
> -       err = ovs_key_from_nlattrs(match, NULL, key_attrs, a, false);
> +       err = ovs_key_from_nlattrs(match, key_attrs, a, false);
>         if (err)
>                 return err;
>
> -       if (exact_5tuple)
> -               *exact_5tuple = true;
> -
>         if (mask) {
>                 err = parse_flow_mask_nlattrs(mask, a, &mask_attrs);
>                 if (err)
> @@ -896,7 +846,7 @@ int ovs_nla_get_match(struct sw_flow_match *match,
>                         }
>                 }
>
> -               err = ovs_key_from_nlattrs(match, exact_5tuple, mask_attrs, 
> a, true);
> +               err = ovs_key_from_nlattrs(match, mask_attrs, a, true);
>                 if (err)
>                         return err;
>         } else {
> diff --git a/datapath/flow_netlink.h b/datapath/flow_netlink.h
> index b31fbe2..4401510 100644
> --- a/datapath/flow_netlink.h
> +++ b/datapath/flow_netlink.h
> @@ -45,7 +45,6 @@ int ovs_nla_put_flow(const struct sw_flow_key *,
>  int ovs_nla_get_flow_metadata(struct sw_flow *flow,
>                               const struct nlattr *attr);
>  int ovs_nla_get_match(struct sw_flow_match *match,
> -                     bool *exact_5tuple,
>                       const struct nlattr *,
>                       const struct nlattr *);
>
> diff --git a/datapath/flow_table.c b/datapath/flow_table.c
> index 20854ac..bbc300b 100644
> --- a/datapath/flow_table.c
> +++ b/datapath/flow_table.c
> @@ -50,6 +50,7 @@
>  #define REHASH_INTERVAL                (10 * 60 * HZ)
>
>  static struct kmem_cache *flow_cache;
> +struct kmem_cache *flow_stats_cache;
>
can you mark it __read_mostly.

>  static u16 range_n_bytes(const struct sw_flow_key_range *range)
>  {
> @@ -74,10 +75,10 @@ void ovs_flow_mask_key(struct sw_flow_key *dst, const 
> struct sw_flow_key *src,
>                 *d++ = *s++ & *m++;
>  }
>
> -struct sw_flow *ovs_flow_alloc(bool percpu_stats)
> +struct sw_flow *ovs_flow_alloc(void)
>  {
>         struct sw_flow *flow;
> -       int cpu;
> +       int node;
>
>         flow = kmem_cache_alloc(flow_cache, GFP_KERNEL);
>         if (!flow)
> @@ -86,26 +87,19 @@ struct sw_flow *ovs_flow_alloc(bool percpu_stats)
>         flow->sf_acts = NULL;
>         flow->mask = NULL;
>
> -       flow->stats.is_percpu = percpu_stats;
> +       /* Initialize the default stat node. */
> +       flow->stats[0] = kmem_cache_alloc_node(flow_stats_cache,
> +                                              GFP_KERNEL | __GFP_ZERO, 0);
> +       if (!flow->stats[0])
> +               goto err;
>
> -       if (!percpu_stats) {
> -               flow->stats.stat = kzalloc(sizeof(*flow->stats.stat), 
> GFP_KERNEL);
> -               if (!flow->stats.stat)
> -                       goto err;
> +       spin_lock_init(&flow->stats[0]->lock);
> +       flow->stats[0]->last_writer = -1;
>
> -               spin_lock_init(&flow->stats.stat->lock);
> -       } else {
> -               flow->stats.cpu_stats = alloc_percpu(struct flow_stats);
> -               if (!flow->stats.cpu_stats)
> -                       goto err;
> -
> -               for_each_possible_cpu(cpu) {
> -                       struct flow_stats *cpu_stats;
> +       for_each_node(node)
> +               if (node > 0)
> +                       flow->stats[node] = NULL;
>
> -                       cpu_stats = per_cpu_ptr(flow->stats.cpu_stats, cpu);
> -                       spin_lock_init(&cpu_stats->lock);
> -               }
> -       }
>         return flow;
>  err:
>         kmem_cache_free(flow_cache, flow);
> @@ -142,11 +136,12 @@ static struct flex_array *alloc_buckets(unsigned int 
> n_buckets)
>
>  static void flow_free(struct sw_flow *flow)
>  {
> +       int node;
> +
>         kfree((struct sf_flow_acts __force *)flow->sf_acts);
> -       if (flow->stats.is_percpu)
> -               free_percpu(flow->stats.cpu_stats);
> -       else
> -               kfree(flow->stats.stat);
> +       for_each_node(node)
> +               if (flow->stats[node])
> +                       kmem_cache_free(flow_stats_cache, flow->stats[node]);
>         kmem_cache_free(flow_cache, flow);
>  }
>
> @@ -607,16 +602,28 @@ int ovs_flow_init(void)
>         BUILD_BUG_ON(__alignof__(struct sw_flow_key) % __alignof__(long));
>         BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long));
>
> -       flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow), 0,
> -                                       0, NULL);
> +       flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
> +                                      + (num_possible_nodes()
> +                                         * sizeof(struct flow_stats *)),
> +                                      0, SLAB_HWCACHE_ALIGN, NULL);
>         if (flow_cache == NULL)
>                 return -ENOMEM;
>
> +       flow_stats_cache
> +               = kmem_cache_create("sw_flow_stats", sizeof(struct 
> flow_stats),
> +                                   0, SLAB_HWCACHE_ALIGN, NULL);
> +       if (flow_stats_cache == NULL) {
> +               kmem_cache_destroy(flow_cache);
> +               flow_cache = NULL;
> +               return -ENOMEM;
> +       }
> +
>         return 0;
>  }
>
>  /* Uninitializes the flow module. */
>  void ovs_flow_exit(void)
>  {
> +       kmem_cache_destroy(flow_stats_cache);
>         kmem_cache_destroy(flow_cache);
>  }
> diff --git a/datapath/flow_table.h b/datapath/flow_table.h
> index baaeb10..ca8a582 100644
> --- a/datapath/flow_table.h
> +++ b/datapath/flow_table.h
> @@ -52,10 +52,12 @@ struct flow_table {
>         unsigned int count;
>  };
>
> +extern struct kmem_cache *flow_stats_cache;
> +
>  int ovs_flow_init(void);
>  void ovs_flow_exit(void);
>
> -struct sw_flow *ovs_flow_alloc(bool percpu_stats);
> +struct sw_flow *ovs_flow_alloc(void);
>  void ovs_flow_free(struct sw_flow *, bool deferred);
>
>  int ovs_flow_tbl_init(struct flow_table *);
> --
> 1.7.10.4
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to