https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113552
Bug ID: 113552 Summary: [11/12/13/14 Regression] vectorizer generates calls to vector math routines with 1 simd lane. Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: link-failure Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64-* In GCC 7 the Arm vector PCS was implemented to support libmvec but the libmvec component never made it into glibc until now. GLIBC 2.39 which will be paired with GCC 14 now implements the vector math routines. However consider this function: > cat cosmo.fppized3.f SUBROUTINE a(b) DIMENSION b(3,0) COMMON c DO 4 m=1,c DO 4 d=1,3 b(d,m)=b(d,m)+COS(5.0D00*m) 4 CONTINUE END DIMENSION e(53) DIMENSION f(6,91),g(6,91),h(6,91), * i(6,91),j(6,91),k(6,86) DIMENSION l(107) END and compiled with headers from a glibc 2.39: > aarch64-unknown-linux-gnu-gfortran -S -o - -Ofast > -L/data/repro/glibc/usr/lib64 -I/data/repro/glibc/include > --sysroot=/data/repro/glibc -w cosmo.fppized3.f produces: fmul v13.2d, v13.2d, v19.2d fmov d0, d13 bl _ZGVnN1v_cos fmov d12, d0 dup d0, v13.d[1] bl _ZGVnN1v_cos fmov d31, d0 stp d12, d31, [sp, 96] which has deconstructed the vector to scalar and performs a vector call with 1 element. This is not just inefficient but _ZGVnN1v_cos does not exist in glibc as such code is produced that we cannot link. It looks like the vectorizer starts with 4 floats and widens to 2x 2 double. But then during vectorizable simd this is again split into multiple vectors, even though the operation already fits in a vector: cosmo.fppized3.f:4:13: note: ------>vectorizing SLP node starting from: _49 = __builtin_cos (_48); cosmo.fppized3.f:4:13: note: vect_is_simple_use: operand _47 * 5.0e+0, type of def: internal cosmo.fppized3.f:4:13: note: transform call. cosmo.fppized3.f:4:13: note: add new stmt: _132 = BIT_FIELD_REF <vect__48.26_126, 64, 0>; cosmo.fppized3.f:4:13: note: add new stmt: _133 = cos.simdclone.0 (_132); cosmo.fppized3.f:4:13: note: add new stmt: _134 = BIT_FIELD_REF <vect__48.26_126, 64, 64>; cosmo.fppized3.f:4:13: note: add new stmt: _135 = cos.simdclone.0 (_134); cosmo.fppized3.f:4:13: note: add new stmt: vect__49.27_136 = {_133, _135}; cosmo.fppized3.f:4:13: note: add new stmt: _137 = BIT_FIELD_REF <vect__48.26_127, 64, 0>; cosmo.fppized3.f:4:13: note: add new stmt: _138 = cos.simdclone.0 (_137); cosmo.fppized3.f:4:13: note: add new stmt: _139 = BIT_FIELD_REF <vect__48.26_127, 64, 64>; cosmo.fppized3.f:4:13: note: add new stmt: _140 = cos.simdclone.0 (_139); ... Because we happen to have a V1DF mode that is meant to only be used by some intrinsics the operation succeeds. So several issues here: 1. We should remove the new libmvec headers from glibc from applying to GCC 10,9,8,7 since we can't fix those anymore. So we need a GCC version check on them, however glibc is now frozen for release. 2. The vectorizer should not decompose a simd call if the input and result don't require it. 3. We shouldn't generate a call with simdlen 1. That said in theory this could still be beneficial because it would allow the rest of the code to vectorize and the vector pcs is cheaper to call.