https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92822
Wilco <wilco at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |rguenth at gcc dot gnu.org, | |wilco at gcc dot gnu.org Component|target |tree-optimization --- Comment #4 from Wilco <wilco at gcc dot gnu.org> --- (In reply to nsz from comment #2) > e.g. > > #include <arm_neon.h> > > float32x2_t > foo (float32x2_t v0, float32x4_t v1) > { > return vmulx_laneq_f32 (v0, v1, 0); > } > > used to get translated to > > foo: > fmulx v0.2s, v0.2s, v1.s[0] > ret > > now it is > > foo: > adrp x0, .LC0 > ldr q2, [x0, #:lo12:.LC0] > tbl v1.16b, {v1.16b}, v2.16b > fmulx v0.2s, v0.2s, v1.2s > ret > .size foo, .-foo > .section .rodata.cst16,"aM",@progbits,16 Yes the change inserts a VEC_PERM_EXPR with random values for the upper lanes which becomes a TBL instruction. It happens when you extract a lane from a 128-bit vector and then dup it to a 64-bit vector. Optimized tree before: foo (float32x2_t v0, float32x4_t v1) { float _4; __Float32x2_t _5; __Float32x2_t _6; <bb 2> [local count: 1073741824]: __builtin_aarch64_im_lane_boundsi (16, 4, 0); _4 = BIT_FIELD_REF <v1_3(D), 32, 0>; _5 = {_4, _4}; _6 = __builtin_aarch64_fmulxv2sf (v0_2(D), _5); [tail call] return _6; } And after r278938: foo (float32x2_t v0, float32x4_t v1) { __Float32x2_t _4; __Float32x2_t _7; __Float32x4_t _8; <bb 2> [local count: 1073741824]: __builtin_aarch64_im_lane_boundsi (16, 4, 0); _8 = VEC_PERM_EXPR <v1_3(D), v1_3(D), { 0, 0, 0, 1 }>; _7 = BIT_FIELD_REF <_8, 64, 0>; _4 = __builtin_aarch64_fmulxv2sf (v0_2(D), _7); [tail call] return _4; }