> > > > Reduced the CC list (changing the topic slightly) > > > > > > > > > > My understanding is that the generated code for both your patch > > > > and my changes above is the same. Above suggested changes will > > > > conform to ACLE recommendation. > > > > > > Though instructions are different. Effective cycles are same even > > > though First dup updates the four positions. > > Can you elaborate on how the instructions are different? > > I wrote the following code with both the methods: > > > > uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t *p2, > > uint32_t *p3) { > > uint32x4_t r = {*p0, *p1, *p2, *p3}; > > > > return r; > > } > > > > uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t > > *p2, uint32_t *p3) { > > uint32x4_t r; > > > > r = vdupq_n_u32 (* p0); > > r = vsetq_lane_u32 (*p1, r, 1); > > r = vsetq_lane_u32 (*p2, r, 2); > > r = vsetq_lane_u32 (*p3, r, 3); > > > > return r; > > } > > > > The generated code has the same instructions for both (omitted the > > unwanted > > parts): > > > > u32x4_gather_gcc: > > ld1r {v0.4s}, [x0] > > ld1 {v0.s}[1], [x1] > > ld1 {v0.s}[2], [x2] > > ld1 {v0.s}[3], [x3] > > ret > > > > u32x4_gather_acle: > > ld1r {v0.4s}, [x0] > > ld1 {v0.s}[1], [x1] > > ld1 {v0.s}[2], [x2] > > ld1 {v0.s}[3], [x3] > > ret > > > > The first 'ld1r' updates all the lanes in both the cases. > > > Please check actual generated code for ACL case. We can see difference I think there is something wrong with the way you are looking at the generated code. Please see comments below.
> 0x00000000005cc1dc <+1884>: 80 6a 65 bc ldr s0, [x20, x5] > vs > 0x00000000005cc1dc <+1884>: 9e 6a 65 b8 ldr w30, [x20, x5] The register W30 is a scalar register. > > With patch: > > 244 /* Gather 4 bytes of input data for each stream. */ > 245 input = vdupq_n_s32(GET_NEXT_4BYTES(parms, 0)); > 0x00000000005cc1c8 <+1864>: b4 4f 46 a9 ldp x20, x19, [x29, #96] > 0x00000000005cc1d8 <+1880>: 65 02 40 b9 ldr w5, [x19] > 0x00000000005cc1dc <+1884>: 80 6a 65 bc ldr s0, [x20, x5] > 0x00000000005cc26c <+2028>: 73 12 00 91 add x19, x19, #0x4 > 0x00000000005cc2ac <+2092>: b3 37 00 f9 str x19, [x29, #104] > > 246 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), > input, This one and below ones are not containing any vector instructions. > 1); > 0x00000000005cc1d0 <+1872>: a6 9f 47 a9 ldp x6, x7, [x29, #120] > 0x00000000005cc1ec <+1900>: e5 00 40 b9 ldr w5, [x7] > 0x00000000005cc1f0 <+1904>: d6 68 65 b8 ldr w22, [x6, x5] > 0x00000000005cc21c <+1948>: e7 10 00 91 add x7, x7, #0x4 > 0x00000000005cc260 <+2016>: a7 43 00 f9 str x7, [x29, #128] > > 247 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), > input, > 2); > 0x00000000005cc1d4 <+1876>: b5 4b 40 f9 ldr x21, [x29, #144] > 0x00000000005cc1f4 <+1908>: a6 4f 40 f9 ldr x6, [x29, #152] > 0x00000000005cc1f8 <+1912>: d4 00 40 b9 ldr w20, [x6] > 0x00000000005cc1fc <+1916>: b5 6a 74 b8 ldr w21, [x21, x20] > 0x00000000005cc224 <+1956>: c6 10 00 91 add x6, x6, #0x4 > 0x00000000005cc264 <+2020>: a6 4f 00 f9 str x6, [x29, #152] > > 248 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), > input, > 3); > 0x00000000005cc200 <+1920>: a5 5b 40 f9 ldr x5, [x29, #176] > 0x00000000005cc204 <+1924>: b4 00 40 b9 ldr w20, [x5] > 0x00000000005cc208 <+1928>: a5 10 00 91 add x5, x5, #0x4 > 0x00000000005cc218 <+1944>: b7 57 40 f9 ldr x23, [x29, #168] > 0x00000000005cc220 <+1952>: f4 6a 74 b8 ldr w20, [x23, x20] > 0x00000000005cc228 <+1960>: a5 5b 00 f9 str x5, [x29, #176] > > With out patch: This generated code does not contain any vector instructions. Can you please check? I changed the code to be similar to ACL code, please look at [1], the generated code is the same. [1] https://gcc.godbolt.org/z/p1sQNA > > 245 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 0), > input, > 0); > 0x00000000005cc1c8 <+1864>: b4 4f 46 a9 ldp x20, x19, [x29, #96] > 0x00000000005cc1d8 <+1880>: 65 02 40 b9 ldr w5, [x19] > 0x00000000005cc1dc <+1884>: 9e 6a 65 b8 ldr w30, [x20, x5] > 0x00000000005cc248 <+1992>: 73 12 00 91 add x19, x19, #0x4 > 0x00000000005cc24c <+1996>: b3 37 00 f9 str x19, [x29, #104] > > 246 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 1), > input, > 1); > 0x00000000005cc1d0 <+1872>: a6 9f 47 a9 ldp x6, x7, [x29, #120] > 0x00000000005cc1ec <+1900>: e5 00 40 b9 ldr w5, [x7] > 0x00000000005cc1f0 <+1904>: d6 68 65 b8 ldr w22, [x6, x5] > 0x00000000005cc228 <+1960>: e7 10 00 91 add x7, x7, #0x4 > 0x00000000005cc240 <+1984>: a7 43 00 f9 str x7, [x29, #128] > > 247 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 2), > input, > 2); > 0x00000000005cc1d4 <+1876>: b5 4b 40 f9 ldr x21, [x29, #144] > 0x00000000005cc1f4 <+1908>: a6 4f 40 f9 ldr x6, [x29, #152] > 0x00000000005cc1f8 <+1912>: d4 00 40 b9 ldr w20, [x6] > 0x00000000005cc1fc <+1916>: b5 6a 74 b8 ldr w21, [x21, x20] > 0x00000000005cc22c <+1964>: c6 10 00 91 add x6, x6, #0x4 > 0x00000000005cc244 <+1988>: a6 4f 00 f9 str x6, [x29, #152] > > 248 input = vsetq_lane_s32(GET_NEXT_4BYTES(parms, 3), > input, > 3); > 0x00000000005cc200 <+1920>: a5 5b 40 f9 ldr x5, [x29, #176] > 0x00000000005cc204 <+1924>: b4 00 40 b9 ldr w20, [x5] > 0x00000000005cc208 <+1928>: a5 10 00 91 add x5, x5, #0x4 > 0x00000000005cc21c <+1948>: b7 57 40 f9 ldr x23, [x29, #168] > 0x00000000005cc224 <+1956>: f4 6a 74 b8 ldr w20, [x23, x20] > 0x00000000005cc230 <+1968>: a5 5b 00 f9 str x5, [x29, #176] > > > > > > > To make forward progress send the v2 based on the updated logic > > > just to make ACLE Spec happy, I don’t see any real reason to do it > > > though > > > 😊 > > Thanks for the patch, it was important to make forward progress. > > But, I think we should carry forward the discussion as I plan to > > change other parts of DPDK on similar lines. I want to understand why > > you think there is no real reason. The ACLE recommendation mentions the > reasoning. > > # I see following in the ACLE spec. What is the actual reasoning? > " > ACLE does not define static construction of vector types. E.g. > int32x4_t x = { 1, 2, 3, 4 }; > Is not portable. Use the vcreate or vdup intrinsics to construct values from > scalars. > " Here is the complete text from ACLE 2.1 12.2.6 Compatibility with other vector programming models Programmers should take particular care when combining the Neon Intrinsics API with alternative vector programming models; ACLE does not specify how the NEON Intrinsics API interoperates with them. For instance, the GCC vector extension permits include “arm_neon.h” ... uint32x2_t x = {0, 1}; // GCC extension. uint32_t y = vget_lane_s32 (x, 0); // ACLE NEON Intrinsic. But with this code the value stored in ‘y’ will depend on both the target architecture (AArch32 or AArch64) and whether the program is running in big- or little-endian mode. It is recommended that NEON Intrinsics be used consistently: include “arm_neon.h” ... const int temp[2] = {0, 1}; uint32x2_t x = vld1_s32 (temp); uint32_t y = vget_lane_s32 (x, 0); > > # Why does compiler(gcc) allows if it not indented to use? I do not have an answer. This is a recommendation and all that I am trying to say is, following the recommendation does not cost us anything in performance. > > # I think, it may be time to introduce UndefinedBehaviorSanitizer (UBSan) > Gcc feature to DPDK to detect undefined behavior checks to detect such case I am not sure if it helps here. > > > > > > > > > > http://patches.dpdk.org/patch/54656/ > > >