https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114987
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Summary|[14/15 regression] floating |[14/15 Regression] floating |point vector regression, |point vector regression, |x86, between gcc 14 and |x86, between gcc 14 and |gcc-13 using -O3 and target |gcc-13 using -O3 and target |clones on skylake platforms |clones on skylake platforms Ever confirmed|0 |1 Last reconfirmed| |2024-05-10 Target|x86_64 |x86_64-*-* Target Milestone|--- |14.2 --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- I can't reproduce a slowdown on a Zen2 CPU. The difference seems to be merely instruction scheduling. I do note we're not doing a good job in handling for (i = 0; i < LOOPS_PER_CALL; i++) { r.v = r.v + add.v; } where r.v and add.v are AVX512 sized vectors when emulating them with AVX vectors. We end up with r_v_lsm.48_48 = r.v; _11 = add.v; <bb 3> [local count: 1063004408]: # r_v_lsm.48_50 = PHI <_12(3), r_v_lsm.48_48(2)> # ivtmp_56 = PHI <ivtmp_55(3), 65536(2)> _16 = BIT_FIELD_REF <_11, 256, 0>; _37 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 0>; _29 = _16 + _37; _387 = BIT_FIELD_REF <_11, 256, 256>; _375 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 256>; _363 = _387 + _375; _12 = {_29, _363}; ivtmp_55 = ivtmp_56 - 1; if (ivtmp_55 != 0) goto <bb 3>; [98.99%] else goto <bb 4>; [1.01%] <bb 4> [local count: 10737416]: after lowering from 512bit to 256bit vectors and there's no pass that would demote the 512bit reduction value to two 256bit ones. There's also weird things going on in the target/on RTL. A smaller testcase illustrating the code generation issue is typedef float v16sf __attribute__((vector_size(sizeof(float)*16))); void foo (v16sf * __restrict r, v16sf *a, int n) { for (int i = 0; i < n; ++i) *r = *r + *a; } So confirmed for non-optimal code but I don't see how it's a regression.