https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92822

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org,
                   |                            |wilco at gcc dot gnu.org
          Component|target                      |tree-optimization

--- Comment #4 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to nsz from comment #2)
> e.g.
> 
> #include <arm_neon.h>
> 
> float32x2_t
> foo (float32x2_t v0, float32x4_t v1)
> {
>   return vmulx_laneq_f32 (v0, v1, 0);
> }
> 
> used to get translated to
> 
> foo:
>         fmulx   v0.2s, v0.2s, v1.s[0]
>         ret
> 
> now it is
> 
> foo:
>       adrp    x0, .LC0
>       ldr     q2, [x0, #:lo12:.LC0]
>       tbl     v1.16b, {v1.16b}, v2.16b
>       fmulx   v0.2s, v0.2s, v1.2s
>       ret
>       .size   foo, .-foo
>       .section        .rodata.cst16,"aM",@progbits,16

Yes the change inserts a VEC_PERM_EXPR with random values for the upper lanes
which becomes a TBL instruction. It happens when you extract a lane from a
128-bit vector and then dup it to a 64-bit vector. Optimized tree before:

foo (float32x2_t v0, float32x4_t v1)
{
  float _4;
  __Float32x2_t _5;
  __Float32x2_t _6;

  <bb 2> [local count: 1073741824]:
  __builtin_aarch64_im_lane_boundsi (16, 4, 0);
  _4 = BIT_FIELD_REF <v1_3(D), 32, 0>;
  _5 = {_4, _4};
  _6 = __builtin_aarch64_fmulxv2sf (v0_2(D), _5); [tail call]
  return _6;
}

And after r278938:

foo (float32x2_t v0, float32x4_t v1)
{
  __Float32x2_t _4;
  __Float32x2_t _7;
  __Float32x4_t _8;

  <bb 2> [local count: 1073741824]:
  __builtin_aarch64_im_lane_boundsi (16, 4, 0);
  _8 = VEC_PERM_EXPR <v1_3(D), v1_3(D), { 0, 0, 0, 1 }>;
  _7 = BIT_FIELD_REF <_8, 64, 0>;
  _4 = __builtin_aarch64_fmulxv2sf (v0_2(D), _7); [tail call]
  return _4;
}

Reply via email to