mlx5: e-switch VXLAN flow translation routine

Yongseok Koh Thu, 25 Oct 2018 21:22:21 -0700

On Thu, Oct 25, 2018 at 07:37:56AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Tuesday, October 23, 2018 13:06
> > To: Slava Ovsiienko <[email protected]>
> > Cc: Shahaf Shuler <[email protected]>; [email protected]
> > Subject: Re: [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation
> > routine
> > 
> > On Mon, Oct 15, 2018 at 02:13:31PM +0000, Viacheslav Ovsiienko wrote:
[...]
> > > @@ -2184,13 +2447,16 @@ struct pedit_parser {
> > >   *   Pointer to the list of actions.
> > >   * @param[out] action_flags
> > >   *   Pointer to the detected actions.
> > > + * @param[out] tunnel
> > > + *   Pointer to tunnel encapsulation parameters structure to fill.
> > >   *
> > >   * @return
> > >   *   Maximum size of memory for actions.
> > >   */
> > >  static int
> > >  flow_tcf_get_actions_and_size(const struct rte_flow_action actions[],
> > > -                       uint64_t *action_flags)
> > > +                       uint64_t *action_flags,
> > > +                       void *tunnel)
> > 
> > This func is to get actions and size but you are parsing and filling tunnel 
> > info
> > here. It would be better to move parsing to translate() because it anyway 
> > has
> > multiple if conditions (same as switch/case) to set TCA_TUNNEL_KEY_ENC_*
> > there.
> Do you mean call of flow_tcf_vxlan_encap_parse(actions, tunnel)?


Yes.

> OK, let's move it to translate stage. Anyway, we need to keep encap structure
> for local/neigh rules.
> 
> > 
> > >  {
> > >   int size = 0;
> > >   uint64_t flags = 0;
> > > @@ -2246,6 +2512,29 @@ struct pedit_parser {
> > >                           SZ_NLATTR_TYPE_OF(uint16_t) + /* VLAN ID.
> > */
> > >                           SZ_NLATTR_TYPE_OF(uint8_t); /* VLAN prio.
> > */
> > >                   break;
> > > +         case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > +                 size += SZ_NLATTR_NEST + /* na_act_index. */
> > > +                         SZ_NLATTR_STRZ_OF("tunnel_key") +
> > > +                         SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> > */
> > > +                         SZ_NLATTR_TYPE_OF(uint8_t);
> > > +                 size += SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > > +                 size += flow_tcf_vxlan_encap_parse(actions, tunnel)
> > +
> > > +                         RTE_ALIGN_CEIL /* preceding encap params.
> > */
> > > +                         (sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > +                         MNL_ALIGNTO);
> > 
> > Is it different from SZ_NLATTR_TYPE_OF(struct
> > mlx5_flow_tcf_vxlan_encap)? Or, use __rte_aligned(MNL_ALIGNTO) instead.
> 
> It is written intentionally in this form. It means that there is struct 
> mlx5_flow_tcf_vxlan_encap 
> at the beginning of buffer. This is not the NL attribute, usage of 
> SZ_NLATTR_TYPE_OF is
> not relevant here. Alignment is needed for the following Netlink message.

Good point. Understood.

> > 
> > > +                 flags |= MLX5_ACTION_VXLAN_ENCAP;
> > > +                 break;
> > > +         case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > +                 size += SZ_NLATTR_NEST + /* na_act_index. */
> > > +                         SZ_NLATTR_STRZ_OF("tunnel_key") +
> > > +                         SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> > */
> > > +                         SZ_NLATTR_TYPE_OF(uint8_t);
> > > +                 size += SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > > +                 size += RTE_ALIGN_CEIL /* preceding decap params.
> > */
> > > +                         (sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > +                         MNL_ALIGNTO);
> > 
> > Same here.
> > 
> > > +                 flags |= MLX5_ACTION_VXLAN_DECAP;
> > > +                 break;
> > >           case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
> > >           case RTE_FLOW_ACTION_TYPE_SET_IPV4_DST:
> > >           case RTE_FLOW_ACTION_TYPE_SET_IPV6_SRC:
> > > @@ -2289,6 +2578,26 @@ struct pedit_parser {  }
> > >
> > >  /**
> > > + * Convert VXLAN VNI to 32-bit integer.
> > > + *
> > > + * @param[in] vni
> > > + *   VXLAN VNI in 24-bit wire format.
> > > + *
> > > + * @return
> > > + *   VXLAN VNI as a 32-bit integer value in network endian.
> > > + */
> > > +static rte_be32_t
> > 
> > make it inline.
> OK. Missed point.
> 
> > 
> > > +vxlan_vni_as_be32(const uint8_t vni[3]) {
> > > + rte_be32_t ret;
> > 
> > Defining ret as rte_be32_t? The return value of this func which is 
> > bswap(ret)
> > is also rte_be32_t??
> Yes. And it is directly stored in the net-endian NL attribute. 
> I've compiled and checked the listing of the function you proposed. It seems 
> to be best, I'll take it.
> 
> > 
> > > +
> > > + ret = vni[0];
> > > + ret = (ret << 8) | vni[1];
> > > + ret = (ret << 8) | vni[2];
> > > + return RTE_BE32(ret);
> > 
> > Use rte_cpu_to_be_*() instead. But I still don't understand why you shuffle
> > bytes twice. One with shift and or and other by bswap().
> And it works. There are three bytes in very bizarre order (in NL attribute) - 
> 0, vni[0], vni[1], vni[2].
> 
> > 
> > {
> >     union {
> >             uint8_t vni[4];
> >             rte_be32_t dword;
> >     } ret = {
> >             .vni = { 0, vni[0], vni[1], vni[2] },
> >     };
> >     return ret.dword;
> > }
> > 
> > This will have the same result without extra cost.
> 
> OK. Investigated, it is the best for x86-64. Also I'm going to test it on the 
> ARM 32,
> with various compilers, just curious.
> 
> > 
> > > +}
> > > +
> > > +/**
> > >   * Prepare a flow object for Linux TC flower. It calculates the maximum
> > size of
> > >   * memory required, allocates the memory, initializes Netlink message
> > headers
> > >   * and set unique TC message handle.
> > > @@ -2323,22 +2632,54 @@ struct pedit_parser {
> > >   struct mlx5_flow *dev_flow;
> > >   struct nlmsghdr *nlh;
> > >   struct tcmsg *tcm;
> > > + struct mlx5_flow_tcf_vxlan_encap encap = {.mask = 0};
> > > + uint8_t *sp, *tun = NULL;
> > >
> > >   size += flow_tcf_get_items_and_size(attr, items, item_flags);
> > > - size += flow_tcf_get_actions_and_size(actions, action_flags);
> > > - dev_flow = rte_zmalloc(__func__, size, MNL_ALIGNTO);
> > > + size += flow_tcf_get_actions_and_size(actions, action_flags,
> > &encap);
> > > + dev_flow = rte_zmalloc(__func__, size,
> > > +                 RTE_MAX(alignof(struct mlx5_flow_tcf_tunnel_hdr),
> > > +                         (size_t)MNL_ALIGNTO));
> > 
> > Why RTE_MAX between the two? Note that it is alignment for start address
> > of the memory and the minimum alignment is cacheline size. On x86, non-
> > zero value less than 64 will have same result as 64.
> 
> OK. Thanks for note.
> It is not expected the structure alignments exceed the cache line size.
> So? Just specify zero?
> > 
> > >   if (!dev_flow) {
> > >           rte_flow_error_set(error, ENOMEM,
> > >                              RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > >                              "not enough memory to create E-Switch
> > flow");
> > >           return NULL;
> > >   }
> > > - nlh = mnl_nlmsg_put_header((void *)(dev_flow + 1));
> > > + sp = (uint8_t *)(dev_flow + 1);
> > > + if (*action_flags & MLX5_ACTION_VXLAN_ENCAP) {
> > > +         tun = sp;
> > > +         sp += RTE_ALIGN_CEIL
> > > +                 (sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > +                 MNL_ALIGNTO);
> > 
> > And why should it be aligned? 
> 
> Netlink message should be aligned. It follows the mlx5_flow_tcf_vxlan_encap,
> that's why the pointer is aligned.

Not true. There's no requirement for nl msg buffer alignment. MNL_ALIGNTO is for
mainly size alignment. For example, checkout the source code of
mnl_nlmsg_put_header(void *buf). There's no requirement of aligning the start
address of buf. But, size of any entries (hdr, attr ...) should be aligned to
MNL_ALIGNTO(4).

> 
> As the size of dev_flow might not be aligned, it
> > is meaningless, isn't it? If you think it must be aligned for better 
> > performance
> > (not much anyway), you can use __rte_aligned(MNL_ALIGNTO) on the struct
> Hm. Where we can use __rte_aligned? Could you clarify, please.

For example,

struct mlx5_flow_tcf_tunnel_hdr {
        uint32_t type; /**< Tunnel action type. */
        unsigned int ifindex_tun; /**< Tunnel endpoint interface. */
        unsigned int ifindex_org; /**< Original dst/src interface */
        unsigned int *ifindex_ptr; /**< Interface ptr in message. */
} __rte_aligned(MNL_ALIGNTO);

A good example is the struct rte_mbuf. If this attribute is used, the size of
the struct will be aligned to the value.

If you still want to make the nl msg aligned,

        dev_flow = rte_zmalloc(..., MNL_ALIGNTO); /* anyway cacheline aligned. 
*/
        tun = RTE_PTR_ALIGN(dev_flow + 1, MNL_ALIGNTO);
        nlh = mnl_nlmsg_put_header(tun);

with adding '__rte_aligned(MNL_ALIGNTO)' to struct 
mlx5_flow_tcf_vxlan_encap/decap.

Then, nlh will be aligned. You should make sure size is correctly calculated.

> 
> > definition but not for mlx5_flow (it's not only for tcf, have to do it 
> > manually).
> > 
> > > +         size -= RTE_ALIGN_CEIL
> > > +                 (sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > +                 MNL_ALIGNTO);
> > 
> > Don't you have to subtract sizeof(struct mlx5_flow) as well? But like I
> > mentioned, if '.nlsize' below isn't needed, you don't need to have this
> > calculation either.
> Yes, it is a bug. Should be fixed. Thank you.
> Let's discuss whether we can keep the nlsize under NDEBUG switch.

I agreed on using NDEBUG for it.

> 
> > 
> > > +         encap.hdr.type =
> > MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP;
> > > +         memcpy(tun, &encap,
> > > +                sizeof(struct mlx5_flow_tcf_vxlan_encap));
> > > + } else if (*action_flags & MLX5_ACTION_VXLAN_DECAP) {
> > > +         tun = sp;
> > > +         sp += RTE_ALIGN_CEIL
> > > +                 (sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > +                 MNL_ALIGNTO);
> > > +         size -= RTE_ALIGN_CEIL
> > > +                 (sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > +                 MNL_ALIGNTO);
> > > +         encap.hdr.type =
> > MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP;
> > > +         memcpy(tun, &encap,
> > > +                sizeof(struct mlx5_flow_tcf_vxlan_decap));
> > > + }
> > > + nlh = mnl_nlmsg_put_header(sp);
> > >   tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
> > >   *dev_flow = (struct mlx5_flow){
> > >           .tcf = (struct mlx5_flow_tcf){
> > > +                 .nlsize = size,
> > >                   .nlh = nlh,
> > >                   .tcm = tcm,
> > > +                 .tunnel = (struct mlx5_flow_tcf_tunnel_hdr *)tun,
> > > +                 .item_flags = *item_flags,
> > > +                 .action_flags = *action_flags,
> > >           },
> > >   };
> > >   /*
[...]
> > > @@ -2827,6 +3268,76 @@ struct pedit_parser {
> > >                                   (na_vlan_priority) =
> > >                                   conf.of_set_vlan_pcp->vlan_pcp;
> > >                   }
> > > +                 assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > > +                 break;
> > > +         case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > +                 assert(decap.vxlan);
> > > +                 assert(dev_flow->tcf.tunnel);
> > > +                 dev_flow->tcf.tunnel->ifindex_ptr
> > > +                         = (unsigned int *)&tcm->tcm_ifindex;
> > > +                 na_act_index =
> > > +                         mnl_attr_nest_start(nlh,
> > na_act_index_cur++);
> > > +                 assert(na_act_index);
> > > +                 mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> > "tunnel_key");
> > > +                 na_act = mnl_attr_nest_start(nlh,
> > TCA_ACT_OPTIONS);
> > > +                 assert(na_act);
> > > +                 mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > > +                         sizeof(struct tc_tunnel_key),
> > > +                         &(struct tc_tunnel_key){
> > > +                                 .action = TC_ACT_PIPE,
> > > +                                 .t_action =
> > TCA_TUNNEL_KEY_ACT_RELEASE,
> > > +                                 });
> > > +                 mnl_attr_nest_end(nlh, na_act);
> > > +                 mnl_attr_nest_end(nlh, na_act_index);
> > > +                 assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > > +                 break;
> > > +         case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > +                 assert(encap.vxlan);
> > > +                 na_act_index =
> > > +                         mnl_attr_nest_start(nlh,
> > na_act_index_cur++);
> > > +                 assert(na_act_index);
> > > +                 mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> > "tunnel_key");
> > > +                 na_act = mnl_attr_nest_start(nlh,
> > TCA_ACT_OPTIONS);
> > > +                 assert(na_act);
> > > +                 mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > > +                         sizeof(struct tc_tunnel_key),
> > > +                         &(struct tc_tunnel_key){
> > > +                                 .action = TC_ACT_PIPE,
> > > +                                 .t_action =
> > TCA_TUNNEL_KEY_ACT_SET,
> > > +                                 });
> > > +                 if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_UDP_DST)
> > > +                         mnl_attr_put_u16(nlh,
> > > +                                  TCA_TUNNEL_KEY_ENC_DST_PORT,
> > > +                                  encap.vxlan->udp.dst);
> > > +                 if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV4_SRC)
> > > +                         mnl_attr_put_u32(nlh,
> > > +                                  TCA_TUNNEL_KEY_ENC_IPV4_SRC,
> > > +                                  encap.vxlan->ipv4.src);
> > > +                 if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV4_DST)
> > > +                         mnl_attr_put_u32(nlh,
> > > +                                  TCA_TUNNEL_KEY_ENC_IPV4_DST,
> > > +                                  encap.vxlan->ipv4.dst);
> > > +                 if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV6_SRC)
> > > +                         mnl_attr_put(nlh,
> > > +                                  TCA_TUNNEL_KEY_ENC_IPV6_SRC,
> > > +                                  sizeof(encap.vxlan->ipv6.src),
> > > +                                  &encap.vxlan->ipv6.src);
> > > +                 if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV6_DST)
> > > +                         mnl_attr_put(nlh,
> > > +                                  TCA_TUNNEL_KEY_ENC_IPV6_DST,
> > > +                                  sizeof(encap.vxlan->ipv6.dst),
> > > +                                  &encap.vxlan->ipv6.dst);
> > > +                 if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_VXLAN_VNI)
> > > +                         mnl_attr_put_u32(nlh,
> > > +                                  TCA_TUNNEL_KEY_ENC_KEY_ID,
> > > +                                  vxlan_vni_as_be32
> > > +                                         (encap.vxlan->vxlan.vni));
> > > +#ifdef TCA_TUNNEL_KEY_NO_CSUM
> > > +                 mnl_attr_put_u8(nlh, TCA_TUNNEL_KEY_NO_CSUM,
> > 0); #endif
> > 
> > TCA_TUNNEL_KEY_NO_CSUM is anyway defined like others, then why do
> > you treat it differently with #ifdef/#endif?
> 
> As it was found it is not defined on old kernels, on some our CI machines
> compilation errors occurred.

In your first patch, TCA_TUNNEL_KEY_NO_CSUM is defined if there isn't
HAVE_TC_ACT_TUNNEL_KEY. Actually I'm wondering why it is different from
HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT. It looks like the following is needed -
HAVE_TCA_TUNNEL_KEY_NO_CSUM ??


        #ifdef HAVE_TC_ACT_TUNNEL_KEY

        #include <linux/tc_act/tc_tunnel_key.h>

        #ifndef HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT
        #define TCA_TUNNEL_KEY_ENC_DST_PORT 9
        #endif

        #ifndef HAVE_TCA_TUNNEL_KEY_NO_CSUM
        #define TCA_TUNNEL_KEY_NO_CSUM 10
        #endif

        #else /* HAVE_TC_ACT_TUNNEL_KEY */


Thanks,
Yongseok

Re: [dpdk-dev] [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation routine

Reply via email to