https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117850

            Bug ID: 117850
           Summary: GCC emits DUP, UMULL instead of UMULL2
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64*

The following example:

#include <arm_neon.h>

uint16x8_t foo(const uint8x16_t s) {        
    const uint8x16_t f0 = vdupq_n_u8(4);        
    return vmull_u8(vget_high_u8(s), vget_high_u8(f0));
}

compiled with -O3 generates:

foo(__Uint8x16_t):
        movi    v31.8b, 0x4
        dup     d0, v0.d[1]
        umull   v0.8h, v0.8b, v31.8b
        ret

instead of

foo(__Uint8x16_t):
        movi    v1.16b, #4
        umull2  v0.8h, v0.16b, v1.16b
        ret

I think we can fix this an other cases by lowering them in GIMPLE.

concretely the above could be lowered to VEC_WIDEN_MUL and based on the
BIT_FIELD_REFs generated by the vget_high's folded into the proper _lo or _hi
variant.

To do this though we might need to expose valueize to the API so we can look at
the operands rather than having to chase up the SSA_NAME_DEF_STMT.

Are you ok with this Richi?

Reply via email to