On Fri, Nov 03, 2017 at 11:21:43AM +0800, jianbo....@arm.com wrote:
> The 11/02/2017 15:52, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: Guduri Prathyusha [mailto:gprathyu...@caviumnetworks.com]
> > > Sent: Thursday, November 2, 2017 3:34 PM
> > > To: Ananyev, Konstantin <konstantin.anan...@intel.com>
> > > Cc: dev@dpdk.org; jianbo....@arm.com; guduriprathyu...@gmail.com; 
> > > Kantecki, Tomasz <tomasz.kante...@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port 
> > > grouping
> > >
> > > On Thu, Nov 02, 2017 at 02:46:43PM +0000, Ananyev, Konstantin wrote:
> > > > Hi,
> > > Hi
> > > >
> > > > > -----Original Message-----
> > > > > From: Guduri Prathyusha [mailto:gprathyu...@caviumnetworks.com]
> > > > > Sent: Thursday, November 2, 2017 2:31 PM
> > > > > To: Kantecki, Tomasz <tomasz.kante...@intel.com>
> > > > > Cc: jianbo....@arm.com; guduriprathyu...@gmail.com; Ananyev, 
> > > > > Konstantin <konstantin.anan...@intel.com>; dev@dpdk.org; Guduri
> > > > > Prathyusha <gprathyu...@caviumnetworks.com>
> > > > > Subject: [dpdk-dev] [PATCH ] examples/l3fwd: fix aliasing in port 
> > > > > grouping
> > > > >
> > > > > With -f-strict-aliasing enabled by default from -O2, gcc > 5.x gives
>
> May I ask the detail version about the gcc you are using?
>
Am using gcc provided by ubuntu rootfs

root@localhost:~/dpdk# gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-linux-gnu/5/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro
5.4.0-6ubuntu1~16.04.4'
--with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs
--enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++
--prefix=/usr --program-suffix=-5 --enable-shared
--enable-linker-build-id --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --libdir=/usr/lib
--enable-nls --with-sysroot=/ --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-libquadmath --enable-plugin --with-system-zlib
--disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo
--with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-arm64/jre
--enable-java-home
--with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-arm64
--with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-arm64
--with-arch-directory=aarch64
--with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-multiarch
--enable-fix-cortex-a53-843419 --disable-werror
--enable-checking=release --build=aarch64-linux-gnu
--host=aarch64-linux-gnu --target=aarch64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4)

> > > > > undefined behavior in port_groupx4. 'pn' and 'pnum' are two different
> > > > > pointers pointing to same chunk of memory and with -f-strict-aliasing 
> > > > > the
> > > > > pointers are assumed to be pointing to different memory and compiler
> > > > > reorders instructions that depend on pnum and pn. This breaks port
> > > > > grouping algorithm.
> > > > >
> > > > > This patch eliminates the usage of union and uses memcpy for copying
> > > > > gptbl[v].pnum to pn. memcpy when applied on built_in constant size 
> > > > > does
> > > > > not call its library implementation but uses appropriate LD and ST
> > > > > instructions directly and hence no performance overhead.
> > > > >
> > > > > Fixes: 569b290cdb36 ("examples/l3fwd: add NEON implementation")
> > > > > Fixes: af1694d94bf1 ("examples/l3fwd: fix crash with gcc 5")
> > > > > Signed-off-by: Guduri Prathyusha <gprathyu...@caviumnetworks.com>
> > > > > ---
> > > > >  examples/l3fwd/l3fwd_neon.h | 11 +++--------
> > > > >  examples/l3fwd/l3fwd_sse.h  | 11 +++--------
> > > > >  2 files changed, 6 insertions(+), 16 deletions(-)
> > > > >
> > > > > diff --git a/examples/l3fwd/l3fwd_neon.h b/examples/l3fwd/l3fwd_neon.h
> > > > > index 4bc161394..10a602a04 100644
> > > > > --- a/examples/l3fwd/l3fwd_neon.h
> > > > > +++ b/examples/l3fwd/l3fwd_neon.h
> > > > > @@ -100,11 +100,6 @@ static inline uint16_t *
> > > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, uint16x8_t dp1,
> > > > >              uint16x8_t dp2)
> > > > >  {
> > > > > -       union {
> > > > > -               uint16_t u16[FWDSTEP + 1];
> > > > > -               uint64_t u64;
> > > > > -       } *pnum = (void *)pn;
> > > > > -
> > > > >         int32_t v;
> > > > >         uint16x8_t mask = {1, 2, 4, 8, 0, 0, 0, 0};
> > > > >
> > > > > @@ -117,9 +112,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t 
> > > > > *lp, uint16x8_t dp1,
> > > > >
> > > > >         /* if dest port value has changed. */
> > > > >         if (v != GRPMSK) {
> > > > > -               pnum->u64 = gptbl[v].pnum;
> > > > > -               pnum->u16[FWDSTEP] = 1;
> > > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > > +               pn[FWDSTEP] = 1;
> > > > > +               lp = pn + gptbl[v].idx;
> > > > >         }
> > > > >
> > > > >         return lp;
> > > > > diff --git a/examples/l3fwd/l3fwd_sse.h b/examples/l3fwd/l3fwd_sse.h
> > > > > index 831760f02..79a71d77e 100644
> > > > > --- a/examples/l3fwd/l3fwd_sse.h
> > > > > +++ b/examples/l3fwd/l3fwd_sse.h
> > > > > @@ -98,11 +98,6 @@ processx4_step3(struct rte_mbuf *pkt[FWDSTEP], 
> > > > > uint16_t dst_port[FWDSTEP])
> > > > >  static inline uint16_t *
> > > > >  port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t *lp, __m128i dp1, 
> > > > > __m128i dp2)
> > > > >  {
> > > > > -       union {
> > > > > -               uint16_t u16[FWDSTEP + 1];
> > > > > -               uint64_t u64;
> > > > > -       } *pnum = (void *)pn;
> > > > > -
> > > > >         int32_t v;
> > > > >
> > > > >         dp1 = _mm_cmpeq_epi16(dp1, dp2);
> > > > > @@ -114,9 +109,9 @@ port_groupx4(uint16_t pn[FWDSTEP + 1], uint16_t 
> > > > > *lp, __m128i dp1, __m128i dp2)
> > > > >
> > > > >         /* if dest port value has changed. */
> > > > >         if (v != GRPMSK) {
> > > > > -               pnum->u64 = gptbl[v].pnum;
> > > > > -               pnum->u16[FWDSTEP] = 1;
> > > > > -               lp = pnum->u16 + gptbl[v].idx;
> > > > > +               rte_memcpy(pn, &gptbl[v].pnum, sizeof(gptbl[v].pnum));
> > > > > +               pn[FWDSTEP] = 1;
> > > > > +               lp = pn + gptbl[v].idx;
> > > >
> > > > Could you explain a bit more here - which exactly instructions were 
> > > > reordered
> > > > and what kind of problems did it cause?
> > > > Specially on IA?
> > >
> > > This issue is observed on ARM since ARM gcc is more aggressive in
> > > reordering than x86 gcc.
> >
> > Ok, then if x86 is not affected why to modify l3fwd_sse.h at all?
> > Unless there is a reproducible problem with x86 -
> > my preference would be to keep that file intact.
> >
> > > In ARM when v != GRPMSK, the following
> > > instructions ordering is not guarenteed because of strict aliasing.
> > >
> > > lp[0] += gptbl[v].lpv;
> > > pnum->u64 = gptbl[v].pnum;
> > > pnum->u16[FWDSTEP] = 1;
> > > lp = pnum->u16 + gptbl[v].idx;
> >
> > Ok, so what in particular is reordered by the compiler:
> >
> >  lp[0] += gptbl[v].lpv; (1)
> >  pnum->u64 = gptbl[v].pnum; (2)
> >  pnum->u16[FWDSTEP] = 1;   (3)
> >  lp = pnum->u16 + gptbl[v].idx; (4)
> >
> > (2) and (3)?
> > If so I am not sure how it could be a problem:
> > they do stores to the different locations.
> > (1) and (4) as I can see shouldn't be reordered.
> > Anyway - if you think this a compiler reordering issue,
> > then adding rte_compiler_barrier() should fix the issue, right?
>
> Agree.
>
Yes, rte_compiler_barrier() solves the issue. I will spin a V2 if you
are ok with it.
> >
> > >
> > > That results in wrong lp[0] updation.
> > > memcpy in this case will avoid this problem.
> > >
> > > > In any case I don't think using rte_memcpy is a good thing to use here:
> > > > it is a huge inline function - way too much to copy just 64 bit 
> > > > variable.
> > >
> > > I agree that rte_memcpy is overhead in this case but how about using
> > > memcpy that will not use library implementation if the size is constant.
> > > memcpy with constant size uses built_in_memcpy that does not add
> > > performance overhead.
> >
> > On x86 rte_memcpy() doesn't call libc memcpy() at all - it is a separate 
> > function:
> > ib/librte_eal/common/include/arch/x86/rte_memcpy.h
> >
> > >
> > > Thoughts?
> >
> > As I said - if x86 is  not affected - please keep l3fwd_sse.h intact.
> > If it does (still not sure how) - check would compiler barrier help here.
> > Konstantin
> >
>
> --
> IMPORTANT NOTICE: The contents of this email and any attachments are 
> confidential and may also be privileged. If you are not the intended 
> recipient, please notify the sender immediately and do not disclose the 
> contents to any other person, use it for any purpose, or store or copy the 
> information in any medium. Thank you.

Reply via email to