Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler

Honnappa Nagarahalli Sun, 16 Jun 2019 17:49:18 -0700

> >
> > Reduced the CC list (changing the topic slightly)
> >
> > > >
> > > > My understanding is that the generated code for both your patch
> > > > and my changes above is the same. Above suggested changes will
> > > > conform to ACLE recommendation.
> > >
> > > Though instructions are different. Effective cycles are same even
> > > though First dup updates the four positions.
> > Can you elaborate on how the instructions are different?
> > I wrote the following code with both the methods:
> >
> > uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t *p2,
> > uint32_t *p3) {
> >      uint32x4_t r = {*p0, *p1, *p2, *p3};
> >
> >      return r;
> > }
> >
> > uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t
> > *p2, uint32_t *p3) {
> >      uint32x4_t r;
> >
> >      r = vdupq_n_u32 (* p0);
> >      r = vsetq_lane_u32 (*p1, r, 1);
> >      r = vsetq_lane_u32 (*p2, r, 2);
> >      r = vsetq_lane_u32 (*p3, r, 3);
> >
> >      return r;
> > }
> >
> > The generated code has the same instructions for both (omitted the
> > unwanted
> > parts):
> >
> > u32x4_gather_gcc:
> >         ld1r    {v0.4s}, [x0]
> >         ld1     {v0.s}[1], [x1]
> >         ld1     {v0.s}[2], [x2]
> >         ld1     {v0.s}[3], [x3]
> >         ret
> >
> > u32x4_gather_acle:
> >         ld1r    {v0.4s}, [x0]
> >         ld1     {v0.s}[1], [x1]
> >         ld1     {v0.s}[2], [x2]
> >         ld1     {v0.s}[3], [x3]
> >         ret
> >
> > The first 'ld1r' updates all the lanes in both the cases.
> 
> 
> Please check actual generated code for ACL case. We can see difference
I think there is something wrong with the way you are looking at the generated 
code. Please see comments below.


>  0x00000000005cc1dc <+1884>:  80 6a 65 bc     ldr     s0, [x20, x5]
> vs
>   0x00000000005cc1dc <+1884>:  9e 6a 65 b8     ldr     w30, [x20, x5]
The register W30 is a scalar register.

> 
> With patch:
> 
> 244                     /* Gather 4 bytes of input data for each stream. */
> 245                     input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0));
>    0x00000000005cc1c8 <+1864>:  b4 4f 46 a9     ldp     x20, x19, [x29, #96]
>    0x00000000005cc1d8 <+1880>:  65 02 40 b9     ldr     w5, [x19]
>    0x00000000005cc1dc <+1884>:  80 6a 65 bc     ldr     s0, [x20, x5]
>    0x00000000005cc26c <+2028>:  73 12 00 91     add     x19, x19, #0x4
>    0x00000000005cc2ac <+2092>:  b3 37 00 f9     str     x19, [x29, #104]
> 
> 246                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), 
> input,
This one and below ones are not containing any vector instructions.

> 1);
>    0x00000000005cc1d0 <+1872>:  a6 9f 47 a9     ldp     x6, x7, [x29, #120]
>    0x00000000005cc1ec <+1900>:  e5 00 40 b9     ldr     w5, [x7]
>    0x00000000005cc1f0 <+1904>:  d6 68 65 b8     ldr     w22, [x6, x5]
>    0x00000000005cc21c <+1948>:  e7 10 00 91     add     x7, x7, #0x4
>    0x00000000005cc260 <+2016>:  a7 43 00 f9     str     x7, [x29, #128]
> 
> 247                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), 
> input,
> 2);
>    0x00000000005cc1d4 <+1876>:  b5 4b 40 f9     ldr     x21, [x29, #144]
>    0x00000000005cc1f4 <+1908>:  a6 4f 40 f9     ldr     x6, [x29, #152]
>    0x00000000005cc1f8 <+1912>:  d4 00 40 b9     ldr     w20, [x6]
>    0x00000000005cc1fc <+1916>:  b5 6a 74 b8     ldr     w21, [x21, x20]
>    0x00000000005cc224 <+1956>:  c6 10 00 91     add     x6, x6, #0x4
>    0x00000000005cc264 <+2020>:  a6 4f 00 f9     str     x6, [x29, #152]
> 
> 248                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), 
> input,
> 3);
>    0x00000000005cc200 <+1920>:  a5 5b 40 f9     ldr     x5, [x29, #176]
>    0x00000000005cc204 <+1924>:  b4 00 40 b9     ldr     w20, [x5]
>    0x00000000005cc208 <+1928>:  a5 10 00 91     add     x5, x5, #0x4
>    0x00000000005cc218 <+1944>:  b7 57 40 f9     ldr     x23, [x29, #168]
>    0x00000000005cc220 <+1952>:  f4 6a 74 b8     ldr     w20, [x23, x20]
>    0x00000000005cc228 <+1960>:  a5 5b 00 f9     str     x5, [x29, #176]
> 
> With out patch:
This generated code does not contain any vector instructions. Can you please 
check?
I changed the code to be similar to ACL code, please look at [1], the generated 
code is the same.

[1] https://gcc.godbolt.org/z/p1sQNA

> 
>    245                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), 
> input,
> 0);
>    0x00000000005cc1c8 <+1864>:  b4 4f 46 a9     ldp     x20, x19, [x29, #96]
>    0x00000000005cc1d8 <+1880>:  65 02 40 b9     ldr     w5, [x19]
>    0x00000000005cc1dc <+1884>:  9e 6a 65 b8     ldr     w30, [x20, x5]
>    0x00000000005cc248 <+1992>:  73 12 00 91     add     x19, x19, #0x4
>    0x00000000005cc24c <+1996>:  b3 37 00 f9     str     x19, [x29, #104]
> 
> 246                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), 
> input,
> 1);
>    0x00000000005cc1d0 <+1872>:  a6 9f 47 a9     ldp     x6, x7, [x29, #120]
>    0x00000000005cc1ec <+1900>:  e5 00 40 b9     ldr     w5, [x7]
>    0x00000000005cc1f0 <+1904>:  d6 68 65 b8     ldr     w22, [x6, x5]
>    0x00000000005cc228 <+1960>:  e7 10 00 91     add     x7, x7, #0x4
>    0x00000000005cc240 <+1984>:  a7 43 00 f9     str     x7, [x29, #128]
> 
> 247                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), 
> input,
> 2);
>    0x00000000005cc1d4 <+1876>:  b5 4b 40 f9     ldr     x21, [x29, #144]
>    0x00000000005cc1f4 <+1908>:  a6 4f 40 f9     ldr     x6, [x29, #152]
>    0x00000000005cc1f8 <+1912>:  d4 00 40 b9     ldr     w20, [x6]
>    0x00000000005cc1fc <+1916>:  b5 6a 74 b8     ldr     w21, [x21, x20]
>    0x00000000005cc22c <+1964>:  c6 10 00 91     add     x6, x6, #0x4
>    0x00000000005cc244 <+1988>:  a6 4f 00 f9     str     x6, [x29, #152]
> 
> 248                     input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), 
> input,
> 3);
>    0x00000000005cc200 <+1920>:  a5 5b 40 f9     ldr     x5, [x29, #176]
>    0x00000000005cc204 <+1924>:  b4 00 40 b9     ldr     w20, [x5]
>    0x00000000005cc208 <+1928>:  a5 10 00 91     add     x5, x5, #0x4
>    0x00000000005cc21c <+1948>:  b7 57 40 f9     ldr     x23, [x29, #168]
>    0x00000000005cc224 <+1956>:  f4 6a 74 b8     ldr     w20, [x23, x20]
>    0x00000000005cc230 <+1968>:  a5 5b 00 f9     str     x5, [x29, #176]
> 
> 
> >
> > > To make forward progress send the v2 based on the updated logic
> > > just to make ACLE  Spec happy, I don’t see any real reason to do it
> > > though
> > > 😊
> > Thanks for the patch, it was important to make forward progress.
> > But, I think we should carry forward the discussion as I plan to
> > change other parts of DPDK on similar lines. I want to understand why
> > you think there is no real reason. The ACLE recommendation mentions the
> reasoning.
> 
> # I see following in the ACLE spec. What is the actual reasoning?
> "
> ACLE does not define static construction of vector types. E.g.
>  int32x4_t x = { 1, 2, 3, 4 };
> Is not portable. Use the vcreate or vdup intrinsics to construct values from
> scalars.
> "
Here is the complete text from ACLE 2.1

12.2.6 Compatibility with other vector programming models
Programmers should take particular care when combining the Neon Intrinsics API 
with alternative vector programming models; ACLE does not specify how the NEON 
Intrinsics API interoperates with them.
For instance, the GCC vector extension permits
include “arm_neon.h”
...
uint32x2_t x = {0, 1}; // GCC extension.
uint32_t y = vget_lane_s32 (x, 0); // ACLE NEON Intrinsic.
But with this code the value stored in ‘y’ will depend on both the target 
architecture (AArch32 or AArch64) and whether the program is running in big- or 
little-endian mode.
It is recommended that NEON Intrinsics be used consistently:
include “arm_neon.h”
...
const int temp[2] = {0, 1};
uint32x2_t x = vld1_s32 (temp);
uint32_t y = vget_lane_s32 (x, 0);

> 
> # Why does compiler(gcc) allows if it not indented to use?
I do not have an answer. This is a recommendation and all that I am trying to 
say is, following the recommendation does not cost us anything in performance.

> 
> # I think, it may be time to introduce UndefinedBehaviorSanitizer (UBSan)
> Gcc feature to DPDK to detect undefined behavior checks to detect such case
I am not sure if it helps here.

> 
> >
> 
> > >
> > > http://patches.dpdk.org/patch/54656/
> > >

Re: [dpdk-dev] [PATCH] acl: fix build issue with some arm64 compiler

Reply via email to