On Thu, Jul 12, 2018 at 12:46:46PM +0200, Adrien Mazarguil wrote: > On Wed, Jul 11, 2018 at 05:59:18PM -0700, Yongseok Koh wrote: > > On Wed, Jun 27, 2018 at 08:08:12PM +0200, Adrien Mazarguil wrote: > > > Because mlx5 switch flow rules are configured through Netlink (TC > > > interface) and have little in common with Verbs, this patch adds a > > > separate > > > parser function to handle them. > > > > > > - mlx5_nl_flow_transpose() converts a rte_flow rule to its TC equivalent > > > and stores the result in a buffer. > > > > > > - mlx5_nl_flow_brand() gives a unique handle to a flow rule buffer. > > > > > > - mlx5_nl_flow_create() instantiates a flow rule on the device based on > > > such a buffer. > > > > > > - mlx5_nl_flow_destroy() performs the reverse operation. > > > > > > These functions are called by the existing implementation when > > > encountering > > > flow rules which must be offloaded to the switch (currently relying on the > > > transfer attribute). > > > > > > Signed-off-by: Adrien Mazarguil <adrien.mazarg...@6wind.com> > > > Signed-off-by: Nelio Laranjeiro <nelio.laranje...@6wind.com> > <snip> > > > diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c > > > index 9241855be..93b245991 100644 > > > --- a/drivers/net/mlx5/mlx5_flow.c > > > +++ b/drivers/net/mlx5/mlx5_flow.c > > > @@ -4,6 +4,7 @@ > > > */ > > > > > > #include <sys/queue.h> > > > +#include <stdalign.h> > > > #include <stdint.h> > > > #include <string.h> > > > > > > @@ -271,6 +272,7 @@ struct rte_flow { > > > /**< Store tunnel packet type data to store in Rx queue. */ > > > uint8_t key[40]; /**< RSS hash key. */ > > > uint16_t (*queue)[]; /**< Destination queues to redirect traffic to. */ > > > + void *nl_flow; /**< Netlink flow buffer if relevant. */ > > > }; > > > > > > static const struct rte_flow_ops mlx5_flow_ops = { > > > @@ -2403,6 +2405,106 @@ mlx5_flow_actions(struct rte_eth_dev *dev, > > > } > > > > > > /** > > > + * Validate flow rule and fill flow structure accordingly. > > > + * > > > + * @param dev > > > + * Pointer to Ethernet device. > > > + * @param[out] flow > > > + * Pointer to flow structure. > > > + * @param flow_size > > > + * Size of allocated space for @p flow. > > > + * @param[in] attr > > > + * Flow rule attributes. > > > + * @param[in] pattern > > > + * Pattern specification (list terminated by the END pattern item). > > > + * @param[in] actions > > > + * Associated actions (list terminated by the END action). > > > + * @param[out] error > > > + * Perform verbose error reporting if not NULL. > > > + * > > > + * @return > > > + * A positive value representing the size of the flow object in bytes > > > + * regardless of @p flow_size on success, a negative errno value > > > otherwise > > > + * and rte_errno is set. > > > + */ > > > +static int > > > +mlx5_flow_merge_switch(struct rte_eth_dev *dev, > > > + struct rte_flow *flow, > > > + size_t flow_size, > > > + const struct rte_flow_attr *attr, > > > + const struct rte_flow_item pattern[], > > > + const struct rte_flow_action actions[], > > > + struct rte_flow_error *error) > > > +{ > > > + struct priv *priv = dev->data->dev_private; > > > + unsigned int n = mlx5_domain_to_port_id(priv->domain_id, NULL, 0); > > > + uint16_t port_list[!n + n]; > > > + struct mlx5_nl_flow_ptoi ptoi[!n + n + 1]; > > > + size_t off = RTE_ALIGN_CEIL(sizeof(*flow), alignof(max_align_t)); > > > + unsigned int i; > > > + unsigned int own = 0; > > > + int ret; > > > + > > > + /* At least one port is needed when no switch domain is present. */ > > > + if (!n) { > > > + n = 1; > > > + port_list[0] = dev->data->port_id; > > > + } else { > > > + n = mlx5_domain_to_port_id(priv->domain_id, port_list, n); > > > + if (n > RTE_DIM(port_list)) > > > + n = RTE_DIM(port_list); > > > + } > > > + for (i = 0; i != n; ++i) { > > > + struct rte_eth_dev_info dev_info; > > > + > > > + rte_eth_dev_info_get(port_list[i], &dev_info); > > > + if (port_list[i] == dev->data->port_id) > > > + own = i; > > > + ptoi[i].port_id = port_list[i]; > > > + ptoi[i].ifindex = dev_info.if_index; > > > + } > > > + /* Ensure first entry of ptoi[] is the current device. */ > > > + if (own) { > > > + ptoi[n] = ptoi[0]; > > > + ptoi[0] = ptoi[own]; > > > + ptoi[own] = ptoi[n]; > > > + } > > > + /* An entry with zero ifindex terminates ptoi[]. */ > > > + ptoi[n].port_id = 0; > > > + ptoi[n].ifindex = 0; > > > + if (flow_size < off) > > > + flow_size = 0; > > > + ret = mlx5_nl_flow_transpose((uint8_t *)flow + off, > > > + flow_size ? flow_size - off : 0, > > > + ptoi, attr, pattern, actions, error); > > > + if (ret < 0) > > > + return ret; > > > > So, there's an assumption that the buffer allocated outside of this API is > > enough to include all the messages in mlx5_nl_flow_transpose(), right? If > > flow_size isn't enough, buf_tmp will be used and _transpose() doesn't return > > error but required size. Sounds confusing, may need to make a change or to > > have > > clearer documentation. > > Well, isn't it already documented? Besides these are the usual snprintf() > semantics used everywhere in these files, I think this was a major topic of > discussion with Nelio on the flow rework series :) > > buf_tmp[] is internal to mlx5_nl_flow_transpose() and used as a fallback to > complete a pass when input buffer is not large enough (including > zero-sized). Having a valid buffer is a constraint imposed by libmnl, > because we badly want to know how much space will be needed assuming the > flow rule was successfully processed. > > Without libmnl, the helpers it provides would have been written in a way > that doesn't require buf_tmp[]. However libmnl is just too convenient to > pass up, hence this compromise. > > (just to remind onlookers, we want to allocate the minimum amount of memory > we possibly can for resources needed by each flow rule, and do so through a > single allocation, end goal being to support millions of flow rules while > wasting as little memory as possible.) > > <snip> > > > diff --git a/drivers/net/mlx5/mlx5_nl_flow.c > > > b/drivers/net/mlx5/mlx5_nl_flow.c > > > index 7a8683b03..1fc62fb0a 100644 > > > --- a/drivers/net/mlx5/mlx5_nl_flow.c > > > +++ b/drivers/net/mlx5/mlx5_nl_flow.c > > > @@ -5,7 +5,9 @@ > > > > > > #include <errno.h> > > > #include <libmnl/libmnl.h> > > > +#include <linux/if_ether.h> > > > #include <linux/netlink.h> > > > +#include <linux/pkt_cls.h> > > > #include <linux/pkt_sched.h> > > > #include <linux/rtnetlink.h> > > > #include <stdalign.h> > > > @@ -14,11 +16,248 @@ > > > #include <stdlib.h> > > > #include <sys/socket.h> > > > > > > +#include <rte_byteorder.h> > > > #include <rte_errno.h> > > > #include <rte_flow.h> > > > > > > #include "mlx5.h" > > > > > > +/** Parser state definitions for mlx5_nl_flow_trans[]. */ > > > +enum mlx5_nl_flow_trans { > > > + INVALID, > > > + BACK, > > > + ATTR, > > > + PATTERN, > > > + ITEM_VOID, > > > + ACTIONS, > > > + ACTION_VOID, > > > + END, > > > +}; > > > + > > > +#define TRANS(...) (const enum mlx5_nl_flow_trans []){ __VA_ARGS__, > > > INVALID, } > > > + > > > +#define PATTERN_COMMON \ > > > + ITEM_VOID, ACTIONS > > > +#define ACTIONS_COMMON \ > > > + ACTION_VOID, END > > > + > > > +/** Parser state transitions used by mlx5_nl_flow_transpose(). */ > > > +static const enum mlx5_nl_flow_trans *const mlx5_nl_flow_trans[] = { > > > + [INVALID] = NULL, > > > + [BACK] = NULL, > > > + [ATTR] = TRANS(PATTERN), > > > + [PATTERN] = TRANS(PATTERN_COMMON), > > > + [ITEM_VOID] = TRANS(BACK), > > > + [ACTIONS] = TRANS(ACTIONS_COMMON), > > > + [ACTION_VOID] = TRANS(BACK), > > > + [END] = NULL, > > > +}; > > > + > > > +/** > > > + * Transpose flow rule description to rtnetlink message. > > > + * > > > + * This function transposes a flow rule description to a traffic control > > > + * (TC) filter creation message ready to be sent over Netlink. > > > + * > > > + * Target interface is specified as the first entry of the @p ptoi table. > > > + * Subsequent entries enable this function to resolve other DPDK port IDs > > > + * found in the flow rule. > > > + * > > > + * @param[out] buf > > > + * Output message buffer. May be NULL when @p size is 0. > > > + * @param size > > > + * Size of @p buf. Message may be truncated if not large enough. > > > + * @param[in] ptoi > > > + * DPDK port ID to network interface index translation table. This > > > table > > > + * is terminated by an entry with a zero ifindex value. > > > + * @param[in] attr > > > + * Flow rule attributes. > > > + * @param[in] pattern > > > + * Pattern specification. > > > + * @param[in] actions > > > + * Associated actions. > > > + * @param[out] error > > > + * Perform verbose error reporting if not NULL. > > > + * > > > + * @return > > > + * A positive value representing the exact size of the message in bytes > > > + * regardless of the @p size parameter on success, a negative errno > > > value > > > + * otherwise and rte_errno is set. > > > + */ > > > +int > > > +mlx5_nl_flow_transpose(void *buf, > > > + size_t size, > > > + const struct mlx5_nl_flow_ptoi *ptoi, > > > + const struct rte_flow_attr *attr, > > > + const struct rte_flow_item *pattern, > > > + const struct rte_flow_action *actions, > > > + struct rte_flow_error *error) > > > +{ > > > + alignas(struct nlmsghdr) > > > + uint8_t buf_tmp[MNL_SOCKET_BUFFER_SIZE]; > > > + const struct rte_flow_item *item; > > > + const struct rte_flow_action *action; > > > + unsigned int n; > > > + struct nlattr *na_flower; > > > + struct nlattr *na_flower_act; > > > + const enum mlx5_nl_flow_trans *trans; > > > + const enum mlx5_nl_flow_trans *back; > > > + > > > + if (!size) > > > + goto error_nobufs; > > > +init: > > > + item = pattern; > > > + action = actions; > > > + n = 0; > > > + na_flower = NULL; > > > + na_flower_act = NULL; > > > + trans = TRANS(ATTR); > > > + back = trans; > > > +trans: > > > + switch (trans[n++]) { > > > + struct nlmsghdr *nlh; > > > + struct tcmsg *tcm; > > > + > > > + case INVALID: > > > + if (item->type) > > > + return rte_flow_error_set > > > + (error, ENOTSUP, RTE_FLOW_ERROR_TYPE_ITEM, > > > + item, "unsupported pattern item combination"); > > > + else if (action->type) > > > + return rte_flow_error_set > > > + (error, ENOTSUP, RTE_FLOW_ERROR_TYPE_ACTION, > > > + action, "unsupported action combination"); > > > + return rte_flow_error_set > > > + (error, ENOTSUP, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL, > > > + "flow rule lacks some kind of fate action"); > > > + case BACK: > > > + trans = back; > > > + n = 0; > > > + goto trans; > > > + case ATTR: > > > + /* > > > + * Supported attributes: no groups, some priorities and > > > + * ingress only. Don't care about transfer as it is the > > > + * caller's problem. > > > + */ > > > + if (attr->group) > > > + return rte_flow_error_set > > > + (error, ENOTSUP, > > > + RTE_FLOW_ERROR_TYPE_ATTR_GROUP, > > > + attr, "groups are not supported"); > > > + if (attr->priority > 0xfffe) > > > + return rte_flow_error_set > > > + (error, ENOTSUP, > > > + RTE_FLOW_ERROR_TYPE_ATTR_PRIORITY, > > > + attr, "lowest priority level is 0xfffe"); > > > + if (!attr->ingress) > > > + return rte_flow_error_set > > > + (error, ENOTSUP, > > > + RTE_FLOW_ERROR_TYPE_ATTR_INGRESS, > > > + attr, "only ingress is supported"); > > > + if (attr->egress) > > > + return rte_flow_error_set > > > + (error, ENOTSUP, > > > + RTE_FLOW_ERROR_TYPE_ATTR_INGRESS, > > > + attr, "egress is not supported"); > > > + if (size < mnl_nlmsg_size(sizeof(*tcm))) > > > + goto error_nobufs; > > > + nlh = mnl_nlmsg_put_header(buf); > > > + nlh->nlmsg_type = 0; > > > + nlh->nlmsg_flags = 0; > > > + nlh->nlmsg_seq = 0; > > > + tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm)); > > > + tcm->tcm_family = AF_UNSPEC; > > > + tcm->tcm_ifindex = ptoi[0].ifindex; > > > + /* > > > + * Let kernel pick a handle by default. A predictable handle > > > + * can be set by the caller on the resulting buffer through > > > + * mlx5_nl_flow_brand(). > > > + */ > > > + tcm->tcm_handle = 0; > > > + tcm->tcm_parent = TC_H_MAKE(TC_H_INGRESS, TC_H_MIN_INGRESS); > > > + /* > > > + * Priority cannot be zero to prevent the kernel from > > > + * picking one automatically. > > > + */ > > > + tcm->tcm_info = TC_H_MAKE((attr->priority + 1) << 16, > > > + RTE_BE16(ETH_P_ALL)); > > > + break; > > > + case PATTERN: > > > + if (!mnl_attr_put_strz_check(buf, size, TCA_KIND, "flower")) > > > + goto error_nobufs; > > > + na_flower = mnl_attr_nest_start_check(buf, size, TCA_OPTIONS); > > > + if (!na_flower) > > > + goto error_nobufs; > > > + if (!mnl_attr_put_u32_check(buf, size, TCA_FLOWER_FLAGS, > > > + TCA_CLS_FLAGS_SKIP_SW)) > > > + goto error_nobufs; > > > + break; > > > + case ITEM_VOID: > > > + if (item->type != RTE_FLOW_ITEM_TYPE_VOID) > > > + goto trans; > > > + ++item; > > > + break; > > > + case ACTIONS: > > > + if (item->type != RTE_FLOW_ITEM_TYPE_END) > > > + goto trans; > > > + assert(na_flower); > > > + assert(!na_flower_act); > > > + na_flower_act = > > > + mnl_attr_nest_start_check(buf, size, TCA_FLOWER_ACT); > > > + if (!na_flower_act) > > > + goto error_nobufs; > > > + break; > > > + case ACTION_VOID: > > > + if (action->type != RTE_FLOW_ACTION_TYPE_VOID) > > > + goto trans; > > > + ++action; > > > + break; > > > + case END: > > > + if (item->type != RTE_FLOW_ITEM_TYPE_END || > > > + action->type != RTE_FLOW_ACTION_TYPE_END) > > > + goto trans; > > > + if (na_flower_act) > > > + mnl_attr_nest_end(buf, na_flower_act); > > > + if (na_flower) > > > + mnl_attr_nest_end(buf, na_flower); > > > + nlh = buf; > > > + return nlh->nlmsg_len; > > > + } > > > + back = trans; > > > + trans = mlx5_nl_flow_trans[trans[n - 1]]; > > > + n = 0; > > > + goto trans; > > > +error_nobufs: > > > + if (buf != buf_tmp) { > > > + buf = buf_tmp; > > > + size = sizeof(buf_tmp); > > > + goto init; > > > + } > > > > Continuing my comment above. > > This part is unclear. It looks to me that this func does: > > > > 1) if size is zero, consider it as a testing call to know the amount of > > memory > > required. > > Yeah, in fact this one is a shortcut to speed up this specific scenario as > it happens all the time in the two-pass use case. You can lump it with 2). > > > 2) if size isn't zero but not enough, it stops writing to buf and start > > over to > > return the amount of memory required instead of returning error. > > 3) if size isn't zero and enough, it fills in buf. > > > > Do I correctly understand? > > Yes. Another minor note for 2), the returned buffer is also filled up to the > point of failure (mimics snprintf()). > > Perhaps the following snippet can better summarize the envisioned approach: > > int ret = snprintf(NULL, 0, "something", ...); > > if (ret < 0) { > goto court; > } else { > char buf[ret]; > > snprintf(buf, sizeof(buf), "something", ...); /* Guaranteed. */ > [...] > }
I know you and Nelio mimicked snprintf() but as _merge() isn't a public API for users but an internal API. I didn't think it should necessarily be like that. I hoped to have it used either for testing (knowing size) or real translation - 1) and 3). And no possibility for 2), then 2) would've been handled by assert(). I beleive this could've made the code simpler. However, as I already acked Nelio's patchset, agreed on the idea and Nelio already documented the behavior, Acked-by: Yongseok Koh <ys...@mellanox.com> Thanks