Richard Biener <rguent...@suse.de> writes: > On Thu, 13 Mar 2025, Richard Sandiford wrote: >> vect_slp_prefer_store_lanes_p allows an SLP tree to be split even >> if the tree could use store-lanes, provided that one of the new >> groups would operate on full vectors for each scalar iteration. >> That heuristic is no longer firing for gcc.target/aarch64/pr99873_2.c. >> >> The test contains: >> >> void __attribute ((noipa)) >> foo (uint64_t *__restrict x, uint64_t *__restrict y, int n) >> { >> for (int i = 0; i < n; i += 4) >> { >> x[i] += y[i]; >> x[i + 1] += y[i + 1]; >> x[i + 2] |= y[i + 2]; >> x[i + 3] |= y[i + 3]; >> } >> } >> >> and wants us to use V2DI for the first two elements and V2DI for >> the second two elements, rather than LD4s and ST4s. This gives: >> >> .L3: >> ldp q31, q0, [x0] >> add w3, w3, 1 >> ldp q29, q30, [x1], 32 >> orr v30.16b, v0.16b, v30.16b >> add v31.2d, v29.2d, v31.2d >> stp q31, q30, [x0], 32 >> cmp w2, w3 >> bhi .L3 >> >> instead of: >> >> .L4: >> ld4 {v28.2d - v31.2d}, [x2] >> ld4 {v24.2d - v27.2d}, [x3], 64 >> add v24.2d, v28.2d, v24.2d >> add v25.2d, v29.2d, v25.2d >> orr v26.16b, v30.16b, v26.16b >> orr v27.16b, v31.16b, v27.16b >> st4 {v24.2d - v27.2d}, [x2], 64 >> cmp x2, x5 >> bne .L4 >> >> The first loop only handles half the amount of data per iteration, >> but it requires far fewer internal permutations. >> >> One reason the heuristic no longer fired looks like a typo: the call >> to vect_slp_prefer_store_lanes_p was passing "1" as the new group size, >> instead of "i". >> >> However, even with that fixed, vect_analyze_slp later falls back on >> single-lane SLP with load/store lanes. I think that heuristic too >> should use vect_slp_prefer_store_lanes_p (but it otherwise looks good). >> >> The question is whether every load should pass >> vect_slp_prefer_store_lanes_p or whether just one is enough. >> I don't have an example that would make the call either way, >> so I went for the latter, given that it's the smaller change >> from the status quo. >> >> This also appears to improve fotonik3d and roms from SPEC2017 >> (cross-checked against two different systems). >> >> Bootstrapped & regression-tested on aarch64-linux-gnu, where it >> fixes the pr99873_2.c regression. OK to install? > > risc-v is also heavily depending on load/store-lanes so please run > this throug the riscv CI as well. > > Otherwise this is all heuristics, so I'm fine with this change.
Thanks! https://github.com/ewlu/gcc-precommit-ci/issues/3160 showed no new regressions, so I went ahead and pushed the patch. Richard