https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117850
Bug ID: 117850 Summary: GCC emits DUP, UMULL instead of UMULL2 Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Target Milestone: --- Target: aarch64* The following example: #include <arm_neon.h> uint16x8_t foo(const uint8x16_t s) { const uint8x16_t f0 = vdupq_n_u8(4); return vmull_u8(vget_high_u8(s), vget_high_u8(f0)); } compiled with -O3 generates: foo(__Uint8x16_t): movi v31.8b, 0x4 dup d0, v0.d[1] umull v0.8h, v0.8b, v31.8b ret instead of foo(__Uint8x16_t): movi v1.16b, #4 umull2 v0.8h, v0.16b, v1.16b ret I think we can fix this an other cases by lowering them in GIMPLE. concretely the above could be lowered to VEC_WIDEN_MUL and based on the BIT_FIELD_REFs generated by the vget_high's folded into the proper _lo or _hi variant. To do this though we might need to expose valueize to the API so we can look at the operands rather than having to chase up the SSA_NAME_DEF_STMT. Are you ok with this Richi?