On Tue, Feb 4, 2014 at 11:03 AM, Jarno Rajahalme <jrajaha...@nicira.com> wrote: > Keep kernel flow stats for each NUMA node rather than each (logical) > CPU. This avoids using the per-CPU allocator and removes most of the > kernel-side OVS locking overhead otherwise on the top of perf report > and allows OVS to scale better with higher number of threads. > > With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup > rate doubles on a server with two hyper-threaded physical CPUs (16 > logical cores each) compared to the current OVS master. Tested with > non-trivial flow table with a TCP port match rule forcing all new > connections with unique port numbers to OVS userspace. The IP > addresses are still wildcarded, so the kernel flows are not considered > as exact match 5-tuple flows. This type of flows can be expected to > appear in large numbers as the result of more effective wildcarding > made possible by improvements in OVS userspace flow classifier. > > Perf results for this test (master): > > Events: 305K cycles > + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner > + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock > + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc > + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock > + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area > + 2.19% ovs-vswitchd ovs-vswitchd [.] > flow_hash_in_minimask_range > + 2.03% swapper [kernel.kallsyms] [k] intel_idle > + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock > + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup > + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6 > + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset > + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock > + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock > ... > > And after this patch: > > Events: 356K cycles > + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc > + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock > + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock > + 2.81% ovs-vswitchd ovs-vswitchd [.] > flow_hash_in_minimask_range > + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock > + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup > + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f > + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner > + 1.47% swapper [kernel.kallsyms] [k] intel_idle > + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask > + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref > + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash > + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions > + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref > + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock > ... > > There is a small increase in kernel spinlock overhead due to the same > spinlock being shared between multiple cores of the same physical CPU, > but that is barely visible in the netperf TCP_CRR test performance > (maybe ~1% performance drop, hard to tell exactly due to variance in the test > results), when testing for kernel module throughput (with no userspace > activity, handful of kernel flows). > > This overhead is visible in percentage of all CPU resources attributable to > ovs_flow_stats_update() from 'perf' results: > > per-CPU (master): 0.2 % > per-NUMA (this patch): 0.7 % > per-NUMA, no prefetch: 0.9 % > shared: 1.9 % > shared, no prefetch: 2 % > > I have included the figures for the case where there is simply only > one instance of stats for each flow ('shared') for comparison. In > summary, accessing memory for shared stats 10 times more costly than > per-CPU stats. Per-NUMA stats is in between the two, being about 3 > times more costly than per-CPU stats and one third of the cost of > shared stats. Prefetching makes a small but noticeable contribution > to the performance of per-NUMA stats. > > On flow setup, a single stats instance is allocated (for the NUMA node > 0). As CPUs from multiple NUMA nodes start updating stats, new > NUMA-node specific stats instances are allocated. This allocation on > the packet processing code path is made to never sleep or look for > emergency memory pools, minimizing the allocation latency. If the > allocation fails, the existing preallocated stats instance is used. > Also, if only CPUs from one NUMA-node are updating the preallocated > stats instance, no additional stats instances are allocated. This > eliminates the need to pre-allocate stats instances that will not be > used, also relieving the stats reader from the burden of reading stats > that are never used. Finally, this allocation strategy allows the > removal of the existing exact-5-tuple heuristics. > > Signed-off-by: Jarno Rajahalme <jrajaha...@nicira.com> > > Summary: > --- > v4: Addressed comments by Pravin: > - Do not use __GFP_NOTRACK > - Use spin_lock() instead of spin_lock_bh() when locking stats on remote > nodes. > - Be hotplug compatible by using num_possible_nodes() when allocating. > - Remove unnecessary struct alignment attributes. > > datapath/datapath.c | 38 ++++++++-- > datapath/flow.c | 185 > +++++++++++++++++++++++++---------------------- > datapath/flow.h | 21 +++--- > datapath/flow_netlink.c | 58 +-------------- > datapath/flow_netlink.h | 1 - > datapath/flow_table.c | 57 ++++++++------- > datapath/flow_table.h | 4 +- > 7 files changed, 179 insertions(+), 185 deletions(-) > > diff --git a/datapath/datapath.c b/datapath/datapath.c > index 5f1b34c..b7485c1 100644 > --- a/datapath/datapath.c > +++ b/datapath/datapath.c > @@ -214,6 +214,29 @@ void ovs_dp_detach_port(struct vport *p) > ovs_vport_del(p); > } > > +static void get_skb_stats(struct flow_stats *stat, const struct sw_flow > *flow, > + const struct sk_buff *skb) > +{ > + int node = numa_node_id(); > + > + /* Prepare for writing later. */ > + if (unlikely(!flow->stats[node])) > + node = 0; > + spin_lock_prefetch(&flow->stats[node]->lock); > + > + stat->used = jiffies; > + stat->packet_count = 1; > + stat->byte_count = skb->len; > + stat->tcp_flags = 0; > + > + if ((flow->key.eth.type == htons(ETH_P_IP) || > + flow->key.eth.type == htons(ETH_P_IPV6)) && > + flow->key.ip.proto == IPPROTO_TCP && > + likely(skb->len >= skb_transport_offset(skb) + sizeof(struct > tcphdr))) { > + stat->tcp_flags = TCP_FLAGS_BE16(tcp_hdr(skb)); > + } code indentation.
> +} > + > /* Must be called with rcu_read_lock. */ > void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb) > { > @@ -221,6 +244,7 @@ void ovs_dp_process_received_packet(struct vport *p, > struct sk_buff *skb) > struct sw_flow *flow; > struct dp_stats_percpu *stats; > struct sw_flow_key key; > + struct flow_stats stat; > u64 *stats_counter; > u32 n_mask_hit; > int error; > @@ -252,8 +276,9 @@ void ovs_dp_process_received_packet(struct vport *p, > struct sk_buff *skb) > OVS_CB(skb)->flow = flow; > OVS_CB(skb)->pkt_key = &key; > > - ovs_flow_stats_update(OVS_CB(skb)->flow, skb); > + get_skb_stats(&stat, flow, skb); > ovs_execute_actions(dp, skb); > + ovs_flow_stats_update(flow, &stat); > stats_counter = &stats->n_hit; > > out: > @@ -522,7 +547,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, > struct genl_info *info) > packet->protocol = htons(ETH_P_802_2); > > /* Build an sw_flow for sending this packet. */ > - flow = ovs_flow_alloc(false); > + flow = ovs_flow_alloc(); > err = PTR_ERR(flow); > if (IS_ERR(flow)) > goto err_kfree_skb; > @@ -780,7 +805,6 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, > struct genl_info *info) > struct datapath *dp; > struct sw_flow_actions *acts = NULL; > struct sw_flow_match match; > - bool exact_5tuple; > int error; > > /* Extract key. */ > @@ -789,7 +813,7 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, > struct genl_info *info) > goto error; > > ovs_match_init(&match, &key, &mask); > - error = ovs_nla_get_match(&match, &exact_5tuple, > + error = ovs_nla_get_match(&match, > a[OVS_FLOW_ATTR_KEY], > a[OVS_FLOW_ATTR_MASK]); > if (error) > goto error; > @@ -828,7 +852,7 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, > struct genl_info *info) > goto err_unlock_ovs; > > /* Allocate flow. */ > - flow = ovs_flow_alloc(!exact_5tuple); > + flow = ovs_flow_alloc(); > if (IS_ERR(flow)) { > error = PTR_ERR(flow); > goto err_unlock_ovs; > @@ -912,7 +936,7 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct > genl_info *info) > } > > ovs_match_init(&match, &key, NULL); > - err = ovs_nla_get_match(&match, NULL, a[OVS_FLOW_ATTR_KEY], NULL); > + err = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY], NULL); > if (err) > return err; > > @@ -966,7 +990,7 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct > genl_info *info) > } > > ovs_match_init(&match, &key, NULL); > - err = ovs_nla_get_match(&match, NULL, a[OVS_FLOW_ATTR_KEY], NULL); > + err = ovs_nla_get_match(&match, a[OVS_FLOW_ATTR_KEY], NULL); > if (err) > goto unlock; > > diff --git a/datapath/flow.c b/datapath/flow.c > index 8be3801..7b9faab 100644 > --- a/datapath/flow.c > +++ b/datapath/flow.c > @@ -47,6 +47,26 @@ > > #include "vlan.h" > > +static inline bool alloc_stats(struct sw_flow *flow, > + const struct flow_stats *stat, int node) > +{ > + struct flow_stats *stats; > + /* Alloc stats for the 'node', but with minimal latency. */ > + stats = kmem_cache_alloc_node(flow_stats_cache, GFP_THISNODE | > + __GFP_NOMEMALLOC, node); Generally a line is kept blank after local variable definition. > + if (likely(stats)) { > + stats->used = stat->used; > + stats->packet_count = stat->packet_count; > + stats->byte_count = stat->byte_count; > + stats->tcp_flags = stat->tcp_flags; > + spin_lock_init(&stats->lock); > + stats->last_writer = node; > + > + flow->stats[node] = stats; > + } > + return stats != NULL; > +} > + > u64 ovs_flow_used_time(unsigned long flow_jiffies) > { > struct timespec cur_ts; > @@ -60,114 +80,105 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies) > return cur_ms - idle_ms; > } > > -#define TCP_FLAGS_BE16(tp) (*(__be16 *)&tcp_flag_word(tp) & htons(0x0FFF)) > - > -void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb) > +void ovs_flow_stats_update(struct sw_flow *flow, const struct flow_stats > *stat) > { > - struct flow_stats *stats; > - __be16 tcp_flags = 0; > - > - if (!flow->stats.is_percpu) > - stats = flow->stats.stat; > - else > - stats = this_cpu_ptr(flow->stats.cpu_stats); > - > - if ((flow->key.eth.type == htons(ETH_P_IP) || > - flow->key.eth.type == htons(ETH_P_IPV6)) && > - flow->key.ip.proto == IPPROTO_TCP && > - likely(skb->len >= skb_transport_offset(skb) + sizeof(struct > tcphdr))) { > - tcp_flags = TCP_FLAGS_BE16(tcp_hdr(skb)); > - } > + int node = numa_node_id(); > + struct flow_stats *stats = flow->stats[node]; > > - spin_lock(&stats->lock); > - stats->used = jiffies; > - stats->packet_count++; > - stats->byte_count += skb->len; > - stats->tcp_flags |= tcp_flags; > - spin_unlock(&stats->lock); > -} > + if (unlikely(!stats)) { > + bool done = false; > + stats = flow->stats[0]; /* Pre-allocated. */ > > -static void stats_read(struct flow_stats *stats, bool lock_bh, > - struct ovs_flow_stats *ovs_stats, > - unsigned long *used, __be16 *tcp_flags) > -{ > - if (lock_bh) > - spin_lock_bh(&stats->lock); > - else > spin_lock(&stats->lock); > + /* If the current NUMA-node is the only writer on the 'stats' > + * keep using it. > + * A previous locker may have already allocated the stats. */ > + if (stats->last_writer != node && stats->last_writer >= 0 > + && likely(!flow->stats[node])) > + done = alloc_stats(flow, stat, node); > + if (!done) > + goto update; > + goto unlock; > + } > > - if (time_after(stats->used, *used)) > - *used = stats->used; > - *tcp_flags |= stats->tcp_flags; > - ovs_stats->n_packets += stats->packet_count; > - ovs_stats->n_bytes += stats->byte_count; > - > - if (lock_bh) > - spin_unlock_bh(&stats->lock); > - else > - spin_unlock(&stats->lock); > + spin_lock(&stats->lock); > +update: > + stats->used = stat->used; > + stats->packet_count += stat->packet_count; > + stats->byte_count += stat->byte_count; > + stats->tcp_flags |= stat->tcp_flags; > + stats->last_writer = node; > +unlock: > + spin_unlock(&stats->lock); > } > > void ovs_flow_stats_get(struct sw_flow *flow, struct ovs_flow_stats > *ovs_stats, > unsigned long *used, __be16 *tcp_flags) > { > - int cpu, cur_cpu; > - > - *used = 0; > - *tcp_flags = 0; > - memset(ovs_stats, 0, sizeof(*ovs_stats)); > + int node, local_node = numa_node_id(); > + struct flow_stats *stats; > > - if (!flow->stats.is_percpu) { > - stats_read(flow->stats.stat, true, ovs_stats, used, > tcp_flags); > + /* Start from the current node, if any, as it is more likely to be > + * in the cache by now. Avoid locking when there are no stats. */ > + stats = flow->stats[local_node]; > + if (stats && likely(stats->packet_count)) { > + spin_lock_bh(&stats->lock); > + *used = stats->used; > + *tcp_flags = stats->tcp_flags; > + ovs_stats->n_packets = stats->packet_count; > + ovs_stats->n_bytes = stats->byte_count; > + spin_unlock_bh(&stats->lock); > } else { > - cur_cpu = get_cpu(); > - > - for_each_possible_cpu(cpu) { > - struct flow_stats *stats; > - bool lock_bh; > + *used = 0; > + *tcp_flags = 0; > + ovs_stats->n_packets = 0; > + ovs_stats->n_bytes = 0; > + } > > - stats = per_cpu_ptr(flow->stats.cpu_stats, cpu); > - lock_bh = (cpu == cur_cpu); > - stats_read(stats, lock_bh, ovs_stats, used, > tcp_flags); > + /* Collect per NUMA-node stats from other nodes. */ > + for_each_node(node) { > + if (node == local_node) > + continue; /* Done already. */ > + stats = flow->stats[node]; > + if (stats && likely(stats->packet_count)) { > + spin_lock(&stats->lock); > + if (time_after(stats->used, *used)) > + *used = stats->used; > + *tcp_flags |= stats->tcp_flags; > + ovs_stats->n_packets += stats->packet_count; > + ovs_stats->n_bytes += stats->byte_count; > + spin_unlock(&stats->lock); > } > - put_cpu(); This need to disable preemption. otherwise kernel can reschedule this thread on other numa node and that can cause deadlock. > } > } > > -static void stats_reset(struct flow_stats *stats, bool lock_bh) > -{ > - if (lock_bh) > - spin_lock_bh(&stats->lock); > - else > - spin_lock(&stats->lock); > - > - stats->used = 0; > - stats->packet_count = 0; > - stats->byte_count = 0; > - stats->tcp_flags = 0; > - > - if (lock_bh) > - spin_unlock_bh(&stats->lock); > - else > - spin_unlock(&stats->lock); > -} > - > void ovs_flow_stats_clear(struct sw_flow *flow) > { > - int cpu, cur_cpu; > - > - if (!flow->stats.is_percpu) { > - stats_reset(flow->stats.stat, true); > - } else { > - cur_cpu = get_cpu(); > - > - for_each_possible_cpu(cpu) { > - bool lock_bh; > + int node, local_node = numa_node_id(); > + struct flow_stats *stats; > > - lock_bh = (cpu == cur_cpu); > - stats_reset(per_cpu_ptr(flow->stats.cpu_stats, cpu), > lock_bh); > + /* Start from the current node. */ > + stats = flow->stats[local_node]; > + if (stats) { > + spin_lock_bh(&stats->lock); > + stats->used = 0; > + stats->packet_count = 0; > + stats->byte_count = 0; > + stats->tcp_flags = 0; > + spin_unlock_bh(&stats->lock); > + } > + for_each_node(node) { > + if (node == local_node) > + continue; /* Done already. */ > + stats = flow->stats[node]; > + if (stats) { > + spin_lock(&stats->lock); > + stats->used = 0; > + stats->packet_count = 0; > + stats->byte_count = 0; > + stats->tcp_flags = 0; > + spin_unlock(&stats->lock); > } > - put_cpu(); > } same as above. > } > > diff --git a/datapath/flow.h b/datapath/flow.h > index eafcfd8..4a11914 100644 > --- a/datapath/flow.h > +++ b/datapath/flow.h > @@ -67,6 +67,8 @@ static inline void ovs_flow_tun_key_init(struct > ovs_key_ipv4_tunnel *tun_key, > sizeof(*tun_key) - OVS_TUNNEL_KEY_SIZE); > } > > +#define TCP_FLAGS_BE16(tp) (*(__be16 *)&tcp_flag_word(tp) & htons(0x0FFF)) > + > struct sw_flow_key { > struct ovs_key_ipv4_tunnel tun_key; /* Encapsulating tunnel key. */ > struct { > @@ -155,14 +157,9 @@ struct flow_stats { > unsigned long used; /* Last used time (in jiffies). */ > spinlock_t lock; /* Lock for atomic stats update. */ > __be16 tcp_flags; /* Union of seen TCP flags. */ > -}; > - > -struct sw_flow_stats { > - bool is_percpu; > - union { > - struct flow_stats *stat; > - struct flow_stats __percpu *cpu_stats; > - }; > + int last_writer; /* NUMA-node id of the last writer or > + * -1. Meaningful for 'stats[0]' only. > + */ > }; > > struct sw_flow { > @@ -174,7 +171,11 @@ struct sw_flow { > struct sw_flow_key unmasked_key; > struct sw_flow_mask *mask; > struct sw_flow_actions __rcu *sf_acts; > - struct sw_flow_stats stats; > + struct flow_stats *stats[]; /* One for each NUMA node. First one > + * is allocated at flow creation time, > + * the rest are allocated on demand > + * while holding the 'stats[0].lock'. > + */ > }; > > struct arp_eth_header { > @@ -191,7 +192,7 @@ struct arp_eth_header { > unsigned char ar_tip[4]; /* target IP address */ > } __packed; > > -void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb); > +void ovs_flow_stats_update(struct sw_flow *flow, const struct flow_stats > *stat); > void ovs_flow_stats_get(struct sw_flow *flow, struct ovs_flow_stats *stats, > unsigned long *used, __be16 *tcp_flags); > void ovs_flow_stats_clear(struct sw_flow *flow); > diff --git a/datapath/flow_netlink.c b/datapath/flow_netlink.c > index 39fe4bf..3024a61 100644 > --- a/datapath/flow_netlink.c > +++ b/datapath/flow_netlink.c > @@ -268,20 +268,6 @@ static bool is_all_zero(const u8 *fp, size_t size) > return true; > } > > -static bool is_all_set(const u8 *fp, size_t size) > -{ > - int i; > - > - if (!fp) > - return false; > - > - for (i = 0; i < size; i++) > - if (fp[i] != 0xff) > - return false; > - > - return true; > -} > - > static int __parse_flow_nlattrs(const struct nlattr *attr, > const struct nlattr *a[], > u64 *attrsp, bool nz) > @@ -503,9 +489,8 @@ static int metadata_from_nlattrs(struct sw_flow_match > *match, u64 *attrs, > return 0; > } > > -static int ovs_key_from_nlattrs(struct sw_flow_match *match, bool > *exact_5tuple, > - u64 attrs, const struct nlattr **a, > - bool is_mask) > +static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs, > + const struct nlattr **a, bool is_mask) > { > int err; > u64 orig_attrs = attrs; > @@ -562,11 +547,6 @@ static int ovs_key_from_nlattrs(struct sw_flow_match > *match, bool *exact_5tuple > SW_FLOW_KEY_PUT(match, eth.type, htons(ETH_P_802_2), is_mask); > } > > - if (is_mask && exact_5tuple) { > - if (match->mask->key.eth.type != htons(0xffff)) > - *exact_5tuple = false; > - } > - > if (attrs & (1ULL << OVS_KEY_ATTR_IPV4)) { > const struct ovs_key_ipv4 *ipv4_key; > > @@ -589,13 +569,6 @@ static int ovs_key_from_nlattrs(struct sw_flow_match > *match, bool *exact_5tuple > SW_FLOW_KEY_PUT(match, ipv4.addr.dst, > ipv4_key->ipv4_dst, is_mask); > attrs &= ~(1ULL << OVS_KEY_ATTR_IPV4); > - > - if (is_mask && exact_5tuple && *exact_5tuple) { > - if (ipv4_key->ipv4_proto != 0xff || > - ipv4_key->ipv4_src != htonl(0xffffffff) || > - ipv4_key->ipv4_dst != htonl(0xffffffff)) > - *exact_5tuple = false; > - } > } > > if (attrs & (1ULL << OVS_KEY_ATTR_IPV6)) { > @@ -627,15 +600,6 @@ static int ovs_key_from_nlattrs(struct sw_flow_match > *match, bool *exact_5tuple > is_mask); > > attrs &= ~(1ULL << OVS_KEY_ATTR_IPV6); > - > - if (is_mask && exact_5tuple && *exact_5tuple) { > - if (ipv6_key->ipv6_proto != 0xff || > - !is_all_set((const u8 *)ipv6_key->ipv6_src, > - sizeof(match->key->ipv6.addr.src)) || > - !is_all_set((const u8 *)ipv6_key->ipv6_dst, > - sizeof(match->key->ipv6.addr.dst))) > - *exact_5tuple = false; > - } > } > > if (attrs & (1ULL << OVS_KEY_ATTR_ARP)) { > @@ -678,11 +642,6 @@ static int ovs_key_from_nlattrs(struct sw_flow_match > *match, bool *exact_5tuple > tcp_key->tcp_dst, is_mask); > } > attrs &= ~(1ULL << OVS_KEY_ATTR_TCP); > - > - if (is_mask && exact_5tuple && *exact_5tuple && > - (tcp_key->tcp_src != htons(0xffff) || > - tcp_key->tcp_dst != htons(0xffff))) > - *exact_5tuple = false; > } > > if (attrs & (1ULL << OVS_KEY_ATTR_TCP_FLAGS)) { > @@ -714,11 +673,6 @@ static int ovs_key_from_nlattrs(struct sw_flow_match > *match, bool *exact_5tuple > udp_key->udp_dst, is_mask); > } > attrs &= ~(1ULL << OVS_KEY_ATTR_UDP); > - > - if (is_mask && exact_5tuple && *exact_5tuple && > - (udp_key->udp_src != htons(0xffff) || > - udp_key->udp_dst != htons(0xffff))) > - *exact_5tuple = false; > } > > if (attrs & (1ULL << OVS_KEY_ATTR_SCTP)) { > @@ -804,7 +758,6 @@ static void sw_flow_mask_set(struct sw_flow_mask *mask, > * attribute specifies the mask field of the wildcarded flow. > */ > int ovs_nla_get_match(struct sw_flow_match *match, > - bool *exact_5tuple, > const struct nlattr *key, > const struct nlattr *mask) > { > @@ -852,13 +805,10 @@ int ovs_nla_get_match(struct sw_flow_match *match, > } > } > > - err = ovs_key_from_nlattrs(match, NULL, key_attrs, a, false); > + err = ovs_key_from_nlattrs(match, key_attrs, a, false); > if (err) > return err; > > - if (exact_5tuple) > - *exact_5tuple = true; > - > if (mask) { > err = parse_flow_mask_nlattrs(mask, a, &mask_attrs); > if (err) > @@ -896,7 +846,7 @@ int ovs_nla_get_match(struct sw_flow_match *match, > } > } > > - err = ovs_key_from_nlattrs(match, exact_5tuple, mask_attrs, > a, true); > + err = ovs_key_from_nlattrs(match, mask_attrs, a, true); > if (err) > return err; > } else { > diff --git a/datapath/flow_netlink.h b/datapath/flow_netlink.h > index b31fbe2..4401510 100644 > --- a/datapath/flow_netlink.h > +++ b/datapath/flow_netlink.h > @@ -45,7 +45,6 @@ int ovs_nla_put_flow(const struct sw_flow_key *, > int ovs_nla_get_flow_metadata(struct sw_flow *flow, > const struct nlattr *attr); > int ovs_nla_get_match(struct sw_flow_match *match, > - bool *exact_5tuple, > const struct nlattr *, > const struct nlattr *); > > diff --git a/datapath/flow_table.c b/datapath/flow_table.c > index 20854ac..bbc300b 100644 > --- a/datapath/flow_table.c > +++ b/datapath/flow_table.c > @@ -50,6 +50,7 @@ > #define REHASH_INTERVAL (10 * 60 * HZ) > > static struct kmem_cache *flow_cache; > +struct kmem_cache *flow_stats_cache; > can you mark it __read_mostly. > static u16 range_n_bytes(const struct sw_flow_key_range *range) > { > @@ -74,10 +75,10 @@ void ovs_flow_mask_key(struct sw_flow_key *dst, const > struct sw_flow_key *src, > *d++ = *s++ & *m++; > } > > -struct sw_flow *ovs_flow_alloc(bool percpu_stats) > +struct sw_flow *ovs_flow_alloc(void) > { > struct sw_flow *flow; > - int cpu; > + int node; > > flow = kmem_cache_alloc(flow_cache, GFP_KERNEL); > if (!flow) > @@ -86,26 +87,19 @@ struct sw_flow *ovs_flow_alloc(bool percpu_stats) > flow->sf_acts = NULL; > flow->mask = NULL; > > - flow->stats.is_percpu = percpu_stats; > + /* Initialize the default stat node. */ > + flow->stats[0] = kmem_cache_alloc_node(flow_stats_cache, > + GFP_KERNEL | __GFP_ZERO, 0); > + if (!flow->stats[0]) > + goto err; > > - if (!percpu_stats) { > - flow->stats.stat = kzalloc(sizeof(*flow->stats.stat), > GFP_KERNEL); > - if (!flow->stats.stat) > - goto err; > + spin_lock_init(&flow->stats[0]->lock); > + flow->stats[0]->last_writer = -1; > > - spin_lock_init(&flow->stats.stat->lock); > - } else { > - flow->stats.cpu_stats = alloc_percpu(struct flow_stats); > - if (!flow->stats.cpu_stats) > - goto err; > - > - for_each_possible_cpu(cpu) { > - struct flow_stats *cpu_stats; > + for_each_node(node) > + if (node > 0) > + flow->stats[node] = NULL; > > - cpu_stats = per_cpu_ptr(flow->stats.cpu_stats, cpu); > - spin_lock_init(&cpu_stats->lock); > - } > - } > return flow; > err: > kmem_cache_free(flow_cache, flow); > @@ -142,11 +136,12 @@ static struct flex_array *alloc_buckets(unsigned int > n_buckets) > > static void flow_free(struct sw_flow *flow) > { > + int node; > + > kfree((struct sf_flow_acts __force *)flow->sf_acts); > - if (flow->stats.is_percpu) > - free_percpu(flow->stats.cpu_stats); > - else > - kfree(flow->stats.stat); > + for_each_node(node) > + if (flow->stats[node]) > + kmem_cache_free(flow_stats_cache, flow->stats[node]); > kmem_cache_free(flow_cache, flow); > } > > @@ -607,16 +602,28 @@ int ovs_flow_init(void) > BUILD_BUG_ON(__alignof__(struct sw_flow_key) % __alignof__(long)); > BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long)); > > - flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow), 0, > - 0, NULL); > + flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow) > + + (num_possible_nodes() > + * sizeof(struct flow_stats *)), > + 0, SLAB_HWCACHE_ALIGN, NULL); > if (flow_cache == NULL) > return -ENOMEM; > > + flow_stats_cache > + = kmem_cache_create("sw_flow_stats", sizeof(struct > flow_stats), > + 0, SLAB_HWCACHE_ALIGN, NULL); > + if (flow_stats_cache == NULL) { > + kmem_cache_destroy(flow_cache); > + flow_cache = NULL; > + return -ENOMEM; > + } > + > return 0; > } > > /* Uninitializes the flow module. */ > void ovs_flow_exit(void) > { > + kmem_cache_destroy(flow_stats_cache); > kmem_cache_destroy(flow_cache); > } > diff --git a/datapath/flow_table.h b/datapath/flow_table.h > index baaeb10..ca8a582 100644 > --- a/datapath/flow_table.h > +++ b/datapath/flow_table.h > @@ -52,10 +52,12 @@ struct flow_table { > unsigned int count; > }; > > +extern struct kmem_cache *flow_stats_cache; > + > int ovs_flow_init(void); > void ovs_flow_exit(void); > > -struct sw_flow *ovs_flow_alloc(bool percpu_stats); > +struct sw_flow *ovs_flow_alloc(void); > void ovs_flow_free(struct sw_flow *, bool deferred); > > int ovs_flow_tbl_init(struct flow_table *); > -- > 1.7.10.4 > > _______________________________________________ > dev mailing list > dev@openvswitch.org > http://openvswitch.org/mailman/listinfo/dev _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev