Reduced the CC list (changing the topic slightly) > > > > My understanding is that the generated code for both your patch and my > > changes above is the same. Above suggested changes will conform to > > ACLE recommendation. > > Though instructions are different. Effective cycles are same even though First > dup updates the four positions. Can you elaborate on how the instructions are different? I wrote the following code with both the methods:
uint32x4_t u32x4_gather_gcc (uint32_t *p0, uint32_t *p1, uint32_t *p2, uint32_t *p3) { uint32x4_t r = {*p0, *p1, *p2, *p3}; return r; } uint32x4_t u32x4_gather_acle (uint32_t *p0, uint32_t *p1, uint32_t *p2, uint32_t *p3) { uint32x4_t r; r = vdupq_n_u32 (* p0); r = vsetq_lane_u32 (*p1, r, 1); r = vsetq_lane_u32 (*p2, r, 2); r = vsetq_lane_u32 (*p3, r, 3); return r; } The generated code has the same instructions for both (omitted the unwanted parts): u32x4_gather_gcc: ld1r {v0.4s}, [x0] ld1 {v0.s}[1], [x1] ld1 {v0.s}[2], [x2] ld1 {v0.s}[3], [x3] ret u32x4_gather_acle: ld1r {v0.4s}, [x0] ld1 {v0.s}[1], [x1] ld1 {v0.s}[2], [x2] ld1 {v0.s}[3], [x3] ret The first 'ld1r' updates all the lanes in both the cases. > To make forward progress send the v2 based on the updated logic just to > make ACLE Spec happy, I don’t see any real reason to do it though 😊 Thanks for the patch, it was important to make forward progress. But, I think we should carry forward the discussion as I plan to change other parts of DPDK on similar lines. I want to understand why you think there is no real reason. The ACLE recommendation mentions the reasoning. > > http://patches.dpdk.org/patch/54656/ >