On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> Hi,
Hi
>
> > -----Original Message-----
> > From: Guduri Prathyusha [mailto:gprathyu...@caviumnetworks.com]
> > Sent: Thursday, November 2, 2017 2:31 PM
> > To: Kantecki, Tomasz <tomasz.kante...@intel.com>
> > Cc: jianbo....@arm.com; guduriprathyu...@gmail.com; Ananyev, Konstantin 
> > <konstantin.anan...@intel.com>; dev@dpdk.org; Guduri
> > Prathyusha <gprathyu...@caviumnetworks.com>
> > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port grouping
> >
> > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
> > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > pointers pointing to same chunk of memory and with -f-strict-aliasing the
> > pointers are assumed to be pointing to different memory and compiler
> > reorders instructions that depend on pnum and pn. This breaks port
> > grouping algorithm.
> >
> > This patch eliminates the usage of union and uses memcpy for copying
> > gptbl[v].pnum to pn. memcpy when applied on built_in constant size does
> > not call its library implementation but uses appropriate LD and ST
> > instructions directly and hence no performance overhead.
> >
> > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > Signed-off-by: Guduri Prathyusha <gprathyu...@caviumnetworks.com>
> > ---
> >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> >  2 files changed, 6 insertions(+), 16 deletions(-)
> >
> > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > index 4bc161394..10a602a04 100644
> > --- a/examples/l3fwd/l3fwd_neon.h
> > +++ b/examples/l3fwd/l3fwd_neon.h
> > @@ -100,11 +100,6 @@ static inline uint16_t *
> >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> >          uint16x8_t dp2)
> >  {
> > -   union {
> > -           uint16_t u16[FWDSTEP + 1];
> > -           uint64_t u64;
> > -   } *pnum = (void *)pn;
> > -
> >     int32_t v;
> >     uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> >
> > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, 
> > uint16x8_t dp1,
> >
> >     /* if dest port value has changed. */
> >     if (v != GRPMSK) {
> > -           pnum->u64 = gptbl[v].pnum;
> > -           pnum->u16[FWDSTEP] = 1;
> > -           lp = pnum->u16 + gptbl[v].idx;
> > +           rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > +           pn[FWDSTEP] = 1;
> > +           lp = pn + gptbl[v].idx;
> >     }
> >
> >     return lp;
> > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > index 831760f02..79a71d77e 100644
> > --- a/examples/l3fwd/l3fwd_sse.h
> > +++ b/examples/l3fwd/l3fwd_sse.h
> > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], uint16_t 
> > dst_port[FWDSTEP])
> >  static inline uint16_t *
> >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, __m128i 
> > dp2)
> >  {
> > -   union {
> > -           uint16_t u16[FWDSTEP + 1];
> > -           uint64_t u64;
> > -   } *pnum = (void *)pn;
> > -
> >     int32_t v;
> >
> >     dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, 
> > __m128i dp1, __m128i dp2)
> >
> >     /* if dest port value has changed. */
> >     if (v != GRPMSK) {
> > -           pnum->u64 = gptbl[v].pnum;
> > -           pnum->u16[FWDSTEP] = 1;
> > -           lp = pnum->u16 + gptbl[v].idx;
> > +           rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > +           pn[FWDSTEP] = 1;
> > +           lp = pn + gptbl[v].idx;
>
> Could you explain a bit more here - which exactly instructions were reordered
> and what kind of problems did it cause?
> Specially on IA?

This issue is observed on ARM since ARM gcc is more aggressive in
reordering than x86 gcc. In ARM when v != GRPMSK, the following
instructions ordering is not guarenteed because of strict aliasing.

lp[0] += gptbl[v].lpv;
pnum->u64 = gptbl[v].pnum;
pnum->u16[FWDSTEP] = 1;
lp = pnum->u16 + gptbl[v].idx;

That results in wrong lp[0] updation.
memcpy in this case will avoid this problem.

> In any case I don't think using rte_memcpy is a good thing to use here:
> it is a huge inline function - way too much to copy just 64 bit variable.

I agree that rte_memcpy is overhead in this case but how about using
memcpy that will not use library implementation if the size is constant.
memcpy with constant size uses built_in_memcpy that does not add
performance overhead.

Thoughts?

> Konstantin
>
> >     }
> >
> >     return lp;
> > --
> > 2.14.1
>

Reply via email to