On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote: > > -----Original Message----- > > From: Yongseok Koh > > Sent: Thursday, October 25, 2018 3:28 > > To: Slava Ovsiienko <viachesl...@mellanox.com> > > Cc: Shahaf Shuler <shah...@mellanox.com>; dev@dpdk.org > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices > > management > > > > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote: > > > VXLAN interfaces are dynamically created for each local UDP port of > > > outer networks and then used as targets for TC "flower" filters in > > > order to perform encapsulation. These VXLAN interfaces are > > > system-wide, the only one device with given UDP port can exist in the > > > system (the attempt of creating another device with the same UDP local > > > port returns EEXIST), so PMD should support the shared device > > > instances database for PMD instances. These VXLAN implicitly created > > > devices are called VTEPs (Virtual Tunnel End Points). > > > > > > Creation of the VTEP occurs at the moment of rule applying. The link > > > is set up, root ingress qdisc is also initialized. > > > > > > Encapsulation VTEPs are created on per port basis, the single VTEP is > > > attached to the outer interface and is shared for all encapsulation > > > rules on this interface. The source UDP port is automatically selected > > > in range 30000-60000. > > > > > > For decapsulaton one VTEP is created per every unique UDP local port > > > to accept tunnel traffic. The name of created VTEP consists of prefix > > > "vmlx_" and the number of UDP port in decimal digits without leading > > > zeros (vmlx_4789). The VTEP can be preliminary created in the system > > > before the launching > > > application, it allows to share UDP ports between primary > > > and secondary processes. > > > > > > Suggested-by: Adrien Mazarguil <adrien.mazarg...@6wind.com> > > > Signed-off-by: Viacheslav Ovsiienko <viachesl...@mellanox.com> > > > --- > > > drivers/net/mlx5/mlx5_flow_tcf.c | 503 > > > ++++++++++++++++++++++++++++++++++++++- > > > 1 file changed, 499 insertions(+), 4 deletions(-) > > > > > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c > > > b/drivers/net/mlx5/mlx5_flow_tcf.c > > > index d6840d5..efa9c3b 100644 > > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c > > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c > > > @@ -3443,6 +3443,432 @@ struct pedit_parser { > > > return -err; > > > } > > > > > > +/* VTEP device list is shared between PMD port instances. */ static > > > +LIST_HEAD(, mlx5_flow_tcf_vtep) > > > + vtep_list_vxlan = LIST_HEAD_INITIALIZER(); static > > pthread_mutex_t > > > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER; > > > > What's the reason for choosing pthread_mutex instead of rte_*_lock? > > The sharing this database for secondary processes?
The static variable isn't shared with sec proc. But you can leave it as is. > > > + > > > +/** > > > + * Deletes VTEP network device. > > > + * > > > + * @param[in] tcf > > > + * Context object initialized by mlx5_flow_tcf_context_create(). > > > + * @param[in] vtep > > > + * Object represinting the network device to delete. Memory > > > + * allocated for this object is freed by routine. > > > + */ > > > +static void > > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf, > > > + struct mlx5_flow_tcf_vtep *vtep) { > > > + struct nlmsghdr *nlh; > > > + struct ifinfomsg *ifm; > > > + alignas(struct nlmsghdr) > > > + uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8]; > > > + int ret; > > > + > > > + assert(!vtep->refcnt); > > > + if (vtep->created && vtep->ifindex) { > > > > First of all vtep->created seems of no use. It is introduced to select the > > error > > message in flow_tcf_create_iface(). I don't see any necessity to distinguish > > between 'vtep is allocated by rte_malloc()' and 'vtep is created in kernel'. > > created flag indicates the iface is created by our code. > The VXLAN decap devices must have the specified UDP port, we can not create > multiple VXLAN devices with the same UDP port - EEXIST is returned. So, we > have > to share device. One option is create device before DPDK application launch > and use > these pre-created devices. Inthis case created flag is not set and VXLAN > device > is not reinitialized, and not deleted. I can't see any code to use pre-created device (created even before dpdk app launch). Your code just tries to create 'vmlx_xxxx'. Even from your comment in [7/7] patch, PMD will cleanup any leftovers (existing vtep devices) on initialization. Your comment sounds conflicting and confusing. > > And why do you need to check vtep->ifindex as well? If vtep is created in > > kernel and its ifindex isn't set, that should be an error which had to be > > hanled > > in flow_tcf_create_iface(). Such a vtep shouldn't exist. > Yes, if we did not get ifindex of device - vtep is not created, error > returned. > We just can not operate w/o ifindex. I know ifindex is needed but my question was checking vtep->ifindex here looked redundant/unnecessary. But as you agreed on having create/get/release_iface(), it doesn't matter much. > > Also, the refcnt management is a bit strange. Please put an abstraction by > > adding create_iface(), get_iface() and release_iface(). In the get_ifce(), > > vtep->refcnt should be incremented. And in the release_iface(), it > > vtep->decrease the > OK. Good proposal. I'll refactor the code. > > > refcnt and if it reaches to zero, the iface can be removed. create_iface() > > will > > set the refcnt to 1. And if you refer to mlx5_hrxq_get(), it even does > > searching the list not by repeating the same lookup code here and there. > > That will make your code much simpler. > > > > > + DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex); > > > + nlh = mnl_nlmsg_put_header(buf); > > > + nlh->nlmsg_type = RTM_DELLINK; > > > + nlh->nlmsg_flags = NLM_F_REQUEST; > > > + ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm)); > > > + ifm->ifi_family = AF_UNSPEC; > > > + ifm->ifi_index = vtep->ifindex; > > > + ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL); > > > + if (ret) > > > + DRV_LOG(WARNING, "netlink: error deleting VXLAN > > " > > > + "encap/decap ifindex %u", > > > + ifm->ifi_index); > > > + } > > > + rte_free(vtep); > > > +} > > > + > > > +/** > > > + * Creates VTEP network device. > > > + * > > > + * @param[in] tcf > > > + * Context object initialized by mlx5_flow_tcf_context_create(). > > > + * @param[in] ifouter > > > + * Outer interface to attach new-created VXLAN device > > > + * If zero the VXLAN device will not be attached to any device. > > > + * @param[in] port > > > + * UDP port of created VTEP device. > > > + * @param[out] error > > > + * Perform verbose error reporting if not NULL. > > > + * > > > + * @return > > > + * Pointer to created device structure on success, NULL otherwise > > > + * and rte_errno is set. > > > + */ > > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA > > > > Why negative(ifndef) first intead of positive(ifdef)? > Hm. Did I miss the rule. Positive #ifdef first? OK. No concrete rule but if there's no specific reason, it would be better to start from ifdef. > > > +static struct mlx5_flow_tcf_vtep* > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf __rte_unused, > > > + unsigned int ifouter __rte_unused, > > > + uint16_t port __rte_unused, > > > + struct rte_flow_error *error) { > > > + rte_flow_error_set(error, ENOTSUP, > > > + RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL, > > > + "netlink: failed to create VTEP, " > > > + "VXLAN metadat is not supported by kernel"); > > > > Typo. > > OK. "metadata are not supported". > > > > > + return NULL; > > > +} > > > +#else > > > +static struct mlx5_flow_tcf_vtep* > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf, > > > > How about adding 'vtep'? It sounds vague - creating a general interface. > > E.g., flow_tcf_create_vtep_iface()? > > OK. > > > > > > + unsigned int ifouter, > > > + uint16_t port, struct rte_flow_error *error) { > > > + struct mlx5_flow_tcf_vtep *vtep; > > > + struct nlmsghdr *nlh; > > > + struct ifinfomsg *ifm; > > > + char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24]; > > > + alignas(struct nlmsghdr) > > > + uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 + > > > > Use a macro for '128'. Can't know the meaning. > OK. I think we should calculate the buffer size explicitly. > > > > > > + SZ_NLATTR_DATA_OF(sizeof(name)) + > > > + SZ_NLATTR_NEST * 2 + > > > + SZ_NLATTR_STRZ_OF("vxlan") + > > > + SZ_NLATTR_DATA_OF(sizeof(uint32_t)) + > > > + SZ_NLATTR_DATA_OF(sizeof(uint32_t)) + > > > + SZ_NLATTR_DATA_OF(sizeof(uint16_t)) + > > > + SZ_NLATTR_DATA_OF(sizeof(uint8_t))]; > > > + struct nlattr *na_info; > > > + struct nlattr *na_vxlan; > > > + rte_be16_t vxlan_port = RTE_BE16(port); > > > > Use rte_cpu_to_be_*() instead. > > Yes, I'll recheck the whole code for this issue. > > > > > > + int ret; > > > + > > > + vtep = rte_zmalloc(__func__, sizeof(*vtep), > > > + alignof(struct mlx5_flow_tcf_vtep)); > > > + if (!vtep) { > > > + rte_flow_error_set > > > + (error, ENOMEM, > > RTE_FLOW_ERROR_TYPE_UNSPECIFIED, > > > + NULL, "unadble to allocate memory for VTEP desc"); > > > + return NULL; > > > + } > > > + *vtep = (struct mlx5_flow_tcf_vtep){ > > > + .refcnt = 0, > > > + .port = port, > > > + .created = 0, > > > + .ifouter = 0, > > > + .ifindex = 0, > > > + .local = LIST_HEAD_INITIALIZER(), > > > + .neigh = LIST_HEAD_INITIALIZER(), > > > + }; > > > + memset(buf, 0, sizeof(buf)); > > > + nlh = mnl_nlmsg_put_header(buf); > > > + nlh->nlmsg_type = RTM_NEWLINK; > > > + nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | > > NLM_F_EXCL; > > > + ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm)); > > > + ifm->ifi_family = AF_UNSPEC; > > > + ifm->ifi_type = 0; > > > + ifm->ifi_index = 0; > > > + ifm->ifi_flags = IFF_UP; > > > + ifm->ifi_change = 0xffffffff; > > > + snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX, > > port); > > > + mnl_attr_put_strz(nlh, IFLA_IFNAME, name); > > > + na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO); > > > + assert(na_info); > > > + mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan"); > > > + na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA); > > > + if (ifouter) > > > + mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter); > > > + assert(na_vxlan); > > > + mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1); > > > + mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1); > > > + mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0); > > > + mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port); > > > + mnl_attr_nest_end(nlh, na_vxlan); > > > + mnl_attr_nest_end(nlh, na_info); > > > + assert(sizeof(buf) >= nlh->nlmsg_len); > > > + ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL); > > > + if (ret) > > > + DRV_LOG(WARNING, > > > + "netlink: VTEP %s create failure (%d)", > > > + name, rte_errno); > > > + else > > > + vtep->created = 1; > > > > Flow of code here isn't smooth, thus could be error-prone. Most of all, I > > don't like ret has multiple meanings. ret should be return value but you are > > using it to store ifindex. > > > > > + if (ret && ifouter) > > > + ret = 0; > > > + else > > > + ret = if_nametoindex(name); > > > > If vtep isn't created and ifouter is set, then skip init below, which > > means, if > > ifouter is set for VXLAN encap devices. They should be attached to ifouter > and can not be shared. So, if ifouter I set - we do not use the > precreated/existing > VXLAN devices. We have to create our own not shared device. In your code (flow_tcf_encap_vtep_create()), it is shared by multiple flows. Do you mean it isn't shared between different outer ifaces? If so, that's for sure. > > vtep is created or ifouter is set, it tries to get ifindex of vtep. > > But why do you want to try to call this API even if it failed to create > > vtep? > > Let's not make code flow convoluted even though it logically works. Let's > > make it straightforward. > > > > > + if (ret) { > > > + vtep->ifindex = ret; > > > + vtep->ifouter = ifouter; > > > + memset(buf, 0, sizeof(buf)); > > > + nlh = mnl_nlmsg_put_header(buf); > > > + nlh->nlmsg_type = RTM_NEWLINK; > > > + nlh->nlmsg_flags = NLM_F_REQUEST; > > > + ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm)); > > > + ifm->ifi_family = AF_UNSPEC; > > > + ifm->ifi_type = 0; > > > + ifm->ifi_index = vtep->ifindex; > > > + ifm->ifi_flags = IFF_UP; > > > + ifm->ifi_change = IFF_UP; > > > + ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL); > > > + if (ret) { > > > + DRV_LOG(WARNING, > > > + "netlink: VTEP %s set link up failure (%d)", > > > + name, rte_errno); > > > + rte_free(vtep); > > > + rte_flow_error_set > > > + (error, -errno, > > > + RTE_FLOW_ERROR_TYPE_UNSPECIFIED, > > NULL, > > > + "netlink: failed to set VTEP link up"); > > > + vtep = NULL; > > > + } else { > > > + ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error); > > > + if (ret) > > > + DRV_LOG(WARNING, > > > + "VTEP %s init failure (%d)", name, rte_errno); > > > + } > > > + } else { > > > + DRV_LOG(WARNING, > > > + "VTEP %s failed to get index (%d)", name, errno); > > > + rte_flow_error_set > > > + (error, -errno, > > > + RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL, > > > + !vtep->created ? "netlink: failed to create VTEP" : > > > + "netlink: failed to retrieve VTEP ifindex"); > > > + ret = 1; > > > > If it fails to create a vtep above, it will print out two warning messages > > and > > one rte_flow_error message. And it even selects message to print between > > two? > > And there's another info msg at the end even in case of failure. Do you > > really > > want to do this even with manipulating ret to change code path? Not a good > > practice. > > > > Usually, code path should be straightforward for sucessful path and for > > errors/failures, return immediately or use 'goto' if there's need for > > cleanup. > > > > Please refactor entire function. > > I think I'll split it in two ones - for attached and potentially shared > ifaces. > > > > > + } > > > + if (ret) { > > > + flow_tcf_delete_iface(tcf, vtep); > > > + vtep = NULL; > > > + } > > > + DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" : > > "error"); > > > + return vtep; > > > +} > > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */ > > > + > > > +/** > > > + * Create target interface index for VXLAN tunneling decapsulation. > > > + * In order to share the UDP port within the other interfaces the > > > + * VXLAN device created as not attached to any interface (if created). > > > + * > > > + * @param[in] tcf > > > + * Context object initialized by mlx5_flow_tcf_context_create(). > > > + * @param[in] dev_flow > > > + * Flow tcf object with tunnel structure pointer set. > > > + * @param[out] error > > > + * Perform verbose error reporting if not NULL. > > > + * @return > > > + * Interface index on success, zero otherwise and rte_errno is set. > > > > Return negative errno in case of failure like others. > > Anyway, we have to return an index. If we do not return it as function result > we will need to provide some extra pointing parameter, it complicates the > code. You misunderstood it. See what I wrote below. The function still returns the index but in case of error, make it return negative errno instead of zero. > > > > * Interface index on success, a negative errno value otherwise and > > rte_errno is set. > > > > > + */ > > > +static unsigned int > > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf, > > > + struct mlx5_flow *dev_flow, > > > + struct rte_flow_error *error) > > > +{ > > > + struct mlx5_flow_tcf_vtep *vtep, *vlst; > > > + uint16_t port = dev_flow->tcf.vxlan_decap->udp_port; > > > + > > > + vtep = NULL; > > > + LIST_FOREACH(vlst, &vtep_list_vxlan, next) { > > > + if (vlst->port == port) { > > > + vtep = vlst; > > > + break; > > > + } > > > + } > > > > You just need one variable. > > Yes. There is a long story, I forgot to revert code to one variable after > debugging. > > > > struct mlx5_flow_tcf_vtep *vtep; > > > > LIST_FOREACH(vtep, &vtep_list_vxlan, next) { > > if (vtep->port == port) > > break; > > } > > > > > + if (!vtep) { > > > + vtep = flow_tcf_create_iface(tcf, 0, port, error); > > > + if (vtep) > > > + LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next); > > > + } else { > > > + if (vtep->ifouter) { > > > + rte_flow_error_set(error, -errno, > > > + RTE_FLOW_ERROR_TYPE_UNSPECIFIED, > > NULL, > > > + "Failed to create decap VTEP, attached " > > > + "device with the same UDP port exists"); > > > + vtep = NULL; > > > > Making vtep null to skip the following code? > > Yes. To avoid multiple return operators in code. It's okay to have multiple returns. Why not? > > Please merge the two same > > if/else and make the code path strightforward. And which errno do you > > expect here? > > Should it be set EEXIST instead? > Not always. Netlink returns the code. No, that's not my point. Your code above sets errno instead of rte_errno or EEXIST. } else { if (vtep->ifouter) { rte_flow_error_set(error, -errno, Which one sets this errno? Here, it sets rte_errno because matched vtep can't be used as it already has outer iface attached (error message isn't clear, please reword it too). I thought this should be EEXIST but you set errno to rte_errno but errno isn't valid at this point. > > > > > > + } > > > + } > > > + if (vtep) { > > > + vtep->refcnt++; > > > + assert(vtep->ifindex); > > > + return vtep->ifindex; > > > + } else { > > > + return 0; > > > + } > > > > Why repeating same if/else? > > > > > > This is my suggestion but if you take my suggestion to have > > flow_tcf_[create|get|release]_iface(), this will get much simpler. > Agree. > > > > > { > > struct mlx5_flow_tcf_vtep *vtep; > > uint16_t port = dev_flow->tcf.vxlan_decap->udp_port; > > > > LIST_FOREACH(vtep, &vtep_list_vxlan, next) { > > if (vtep->port == port) > > break; > > } > > if (vtep && vtep->ifouter) > > return rte_flow_error_set(... EEXIST ...); > > else if (vtep) { > > ++vtep->refcnt; > > } else { > > vtep = flow_tcf_create_iface(tcf, 0, port, error); > > if (!vtep) > > return rte_flow_error_set(...); > > LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next); > > } > > assert(vtep->ifindex); > > return vtep->ifindex; > > } > > > > > > > +} > > > + > > > +/** > > > + * Creates target interface index for VXLAN tunneling encapsulation. > > > + * > > > + * @param[in] tcf > > > + * Context object initialized by mlx5_flow_tcf_context_create(). > > > + * @param[in] ifouter > > > + * Network interface index to attach VXLAN encap device to. > > > + * @param[in] dev_flow > > > + * Flow tcf object with tunnel structure pointer set. > > > + * @param[out] error > > > + * Perform verbose error reporting if not NULL. > > > + * @return > > > + * Interface index on success, zero otherwise and rte_errno is set. > > > + */ > > > +static unsigned int > > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf, > > > + unsigned int ifouter, > > > + struct mlx5_flow *dev_flow __rte_unused, > > > + struct rte_flow_error *error) > > > +{ > > > + static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1; > > > + struct mlx5_flow_tcf_vtep *vtep, *vlst; > > > + > > > + assert(ifouter); > > > + /* Look whether the attached VTEP for encap is created. */ > > > + vtep = NULL; > > > + LIST_FOREACH(vlst, &vtep_list_vxlan, next) { > > > + if (vlst->ifouter == ifouter) { > > > + vtep = vlst; > > > + break; > > > + } > > > + } > > > > Same here. > > > > > + if (!vtep) { > > > + uint16_t pcnt; > > > + > > > + /* Not found, we should create the new attached VTEP. */ > > > +/* > > > + * TODO: not implemented yet > > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter); > > > + * flow_tcf_encap_local_cleanup(tcf, ifouter); > > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter); */ > > > > Personal note is not appropriate even though it is removed in the following > > patch. > > > > > + for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX > > > + - MLX5_VXLAN_PORT_RANGE_MIN); > > pcnt++) { > > > + encap_port++; > > > + /* Wraparound the UDP port index. */ > > > + if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN > > || > > > + encap_port > MLX5_VXLAN_PORT_RANGE_MAX) > > > + encap_port = > > MLX5_VXLAN_PORT_RANGE_MIN; > > > + /* Check whether UDP port is in already in use. */ > > > + vtep = NULL; > > > + LIST_FOREACH(vlst, &vtep_list_vxlan, next) { > > > + if (vlst->port == encap_port) { > > > + vtep = vlst; > > > + break; > > > + } > > > + } > > > > If you want to find out an empty port number, you can use rte_bitmap > > instead of repeating searching the entire list for all possible port > > numbers. > > We do not expect too many VXLAN devices have been created. bitmap. +1, valid point. > > > + if (vtep) { > > > + vtep = NULL; > > > + continue; > > > + } > > > + vtep = flow_tcf_create_iface(tcf, ifouter, > > > + encap_port, error); > > > + if (vtep) { > > > + LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, > > next); > > > + break; > > > + } > > > + if (rte_errno != EEXIST) > > > + break; > > > + } > > > + } > > > + if (!vtep) > > > + return 0; > > > + vtep->refcnt++; > > > + assert(vtep->ifindex); > > > + return vtep->ifindex; > > > > Please refactor this func according to what I suggested for > > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface(). > > > > > +} > > > + > > > +/** > > > + * Creates target interface index for tunneling of any type. > > > + * > > > + * @param[in] tcf > > > + * Context object initialized by mlx5_flow_tcf_context_create(). > > > + * @param[in] ifouter > > > + * Network interface index to attach VXLAN encap device to. > > > + * @param[in] dev_flow > > > + * Flow tcf object with tunnel structure pointer set. > > > + * @param[out] error > > > + * Perform verbose error reporting if not NULL. > > > + * @return > > > + * Interface index on success, zero otherwise and rte_errno is set. > > > > * Interface index on success, a negative errno value otherwise and > > * rte_errno is set. > > > > > + */ > > > +static unsigned int > > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf, > > > + unsigned int ifouter, > > > + struct mlx5_flow *dev_flow, > > > + struct rte_flow_error *error) > > > +{ > > > + unsigned int ret; > > > + > > > + assert(dev_flow->tcf.tunnel); > > > + pthread_mutex_lock(&vtep_list_mutex); > > > + switch (dev_flow->tcf.tunnel->type) { > > > + case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP: > > > + ret = flow_tcf_encap_vtep_create(tcf, ifouter, > > > + dev_flow, error); > > > + break; > > > + case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP: > > > + ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error); > > > + break; > > > + default: > > > + rte_flow_error_set(error, ENOTSUP, > > > + RTE_FLOW_ERROR_TYPE_UNSPECIFIED, > > NULL, > > > + "unsupported tunnel type"); > > > + ret = 0; > > > + break; > > > + } > > > + pthread_mutex_unlock(&vtep_list_mutex); > > > + return ret; > > > +} > > > + > > > +/** > > > + * Deletes tunneling interface by UDP port. > > > + * > > > + * @param[in] tcf > > > + * Context object initialized by mlx5_flow_tcf_context_create(). > > > + * @param[in] ifindex > > > + * Network interface index of VXLAN device. > > > + * @param[in] dev_flow > > > + * Flow tcf object with tunnel structure pointer set. > > > + */ > > > +static void > > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf, > > > + unsigned int ifindex, > > > + struct mlx5_flow *dev_flow) > > > +{ > > > + struct mlx5_flow_tcf_vtep *vtep, *vlst; > > > + > > > + assert(dev_flow->tcf.tunnel); > > > + pthread_mutex_lock(&vtep_list_mutex); > > > + vtep = NULL; > > > + LIST_FOREACH(vlst, &vtep_list_vxlan, next) { > > > + if (vlst->ifindex == ifindex) { > > > + vtep = vlst; > > > + break; > > > + } > > > + } > > > > It is weird. You just can have vtep pointer in the dev_flow->tcf.tunnel > > instead > > of ifindex_tun which is same as vtep->ifindex like the assertion below. > > Then, > > this lookup can be skipped. > > OK. Good optimization. > > > > > > + if (!vtep) { > > > + DRV_LOG(WARNING, "No VTEP device found in the list"); > > > + goto exit; > > > + } > > > + switch (dev_flow->tcf.tunnel->type) { > > > + case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP: > > > + break; > > > + case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP: > > > +/* > > > + * TODO: Remove the encap ancillary rules first. > > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL); > > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL); */ > > > > Is it a personal note? Please remove. > OK. > > > > > > + break; > > > + default: > > > + assert(false); > > > + DRV_LOG(WARNING, "Unsupported tunnel type"); > > > + break; > > > + } > > > + assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex); > > > + assert(vtep->refcnt); > > > + if (!vtep->refcnt || !--vtep->refcnt) { > > > + LIST_REMOVE(vtep, next); > > > + flow_tcf_delete_iface(tcf, vtep); > > > + } > > > +exit: > > > + pthread_mutex_unlock(&vtep_list_mutex); > > > +} > > > + > > > /** > > > * Apply flow to E-Switch by sending Netlink message. > > > * > > > @@ -3461,18 +3887,61 @@ struct pedit_parser { > > > struct rte_flow_error *error) { > > > struct priv *priv = dev->data->dev_private; > > > - struct mlx5_flow_tcf_context *nl = priv->tcf_context; > > > + struct mlx5_flow_tcf_context *tcf = priv->tcf_context; > > > struct mlx5_flow *dev_flow; > > > struct nlmsghdr *nlh; > > > + int ret; > > > > > > dev_flow = LIST_FIRST(&flow->dev_flows); > > > /* E-Switch flow can't be expanded. */ > > > assert(!LIST_NEXT(dev_flow, next)); > > > + if (dev_flow->tcf.applied) > > > + return 0; > > > nlh = dev_flow->tcf.nlh; > > > nlh->nlmsg_type = RTM_NEWTFILTER; > > > nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | > > NLM_F_EXCL; > > > - if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL)) > > > + if (dev_flow->tcf.tunnel) { > > > + /* > > > + * Replace the interface index, target for > > > + * encapsulation, source for decapsulation. > > > + */ > > > + assert(!dev_flow->tcf.tunnel->ifindex_tun); > > > + assert(dev_flow->tcf.tunnel->ifindex_ptr); > > > + /* Create actual VTEP device when rule is being applied. */ > > > + dev_flow->tcf.tunnel->ifindex_tun > > > + = flow_tcf_tunnel_vtep_create(tcf, > > > + *dev_flow->tcf.tunnel->ifindex_ptr, > > > + dev_flow, error); > > > + DRV_LOG(INFO, "Replace ifindex: %d->%d", > > > + dev_flow->tcf.tunnel->ifindex_tun, > > > + *dev_flow->tcf.tunnel->ifindex_ptr); > > > + if (!dev_flow->tcf.tunnel->ifindex_tun) > > > + return -rte_errno; > > > + dev_flow->tcf.tunnel->ifindex_org > > > + = *dev_flow->tcf.tunnel->ifindex_ptr; > > > + *dev_flow->tcf.tunnel->ifindex_ptr > > > + = dev_flow->tcf.tunnel->ifindex_tun; > > > + } > > > + ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL); > > > + if (dev_flow->tcf.tunnel) { > > > + DRV_LOG(INFO, "Restore ifindex: %d->%d", > > > + dev_flow->tcf.tunnel->ifindex_org, > > > + *dev_flow->tcf.tunnel->ifindex_ptr); > > > + *dev_flow->tcf.tunnel->ifindex_ptr > > > + = dev_flow->tcf.tunnel->ifindex_org; > > > + dev_flow->tcf.tunnel->ifindex_org = 0; > > > > ifindex_org looks a temporary storage in this code. And this kind of hassle > > (replace/restore) is there because you took the ifindex from the netlink > > message. Why don't you have just > > > > struct mlx5_flow_tcf_tunnel_hdr { > > uint32_t type; /**< Tunnel action type. */ > > unsigned int ifindex; /**< Original dst/src interface */ > > struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */ > > unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message. > > */ }; > > > > and don't change ifindex? > > I propose to use the local variable for ifindex_org and do not keep it > in structure. *ifindex_ptr will keep. Well, you still have to restore the ifindex whenever sending the nl msg. Most of all, ifindex_ptr in nl msg isn't a right place to store the ifindex. It should have vtep ifindex but it just temporarily keeps the device ifindex until vtep is created/found. Thanks, Yongseok