st4 failure

Richard Sandiford Fri, 14 Mar 2025 03:31:41 -0700

Richard Biener <rguent...@suse.de> writes:
> On Thu, 13 Mar 2025, Richard Sandiford wrote:
>> vect_slp_prefer_store_lanes_p allows an SLP tree to be split even
>> if the tree could use store-lanes, provided that one of the new
>> groups would operate on full vectors for each scalar iteration.
>> That heuristic is no longer firing for gcc.target/aarch64/pr99873_2.c.
>> 
>> The test contains:
>> 
>> void __attribute ((noipa))
>> foo (uint64_t *__restrict x, uint64_t *__restrict y, int n)
>> {
>>   for (int i = 0; i < n; i += 4)
>>     {
>>       x[i] += y[i];
>>       x[i + 1] += y[i + 1];
>>       x[i + 2] |= y[i + 2];
>>       x[i + 3] |= y[i + 3];
>>     }
>> }
>> 
>> and wants us to use V2DI for the first two elements and V2DI for
>> the second two elements, rather than LD4s and ST4s.  This gives:
>> 
>> .L3:
>>         ldp     q31, q0, [x0]
>>         add     w3, w3, 1
>>         ldp     q29, q30, [x1], 32
>>         orr     v30.16b, v0.16b, v30.16b
>>         add     v31.2d, v29.2d, v31.2d
>>         stp     q31, q30, [x0], 32
>>         cmp     w2, w3
>>         bhi     .L3
>> 
>> instead of:
>> 
>> .L4:
>>         ld4     {v28.2d - v31.2d}, [x2]
>>         ld4     {v24.2d - v27.2d}, [x3], 64
>>         add     v24.2d, v28.2d, v24.2d
>>         add     v25.2d, v29.2d, v25.2d
>>         orr     v26.16b, v30.16b, v26.16b
>>         orr     v27.16b, v31.16b, v27.16b
>>         st4     {v24.2d - v27.2d}, [x2], 64
>>         cmp     x2, x5
>>         bne     .L4
>> 
>> The first loop only handles half the amount of data per iteration,
>> but it requires far fewer internal permutations.
>> 
>> One reason the heuristic no longer fired looks like a typo: the call
>> to vect_slp_prefer_store_lanes_p was passing "1" as the new group size,
>> instead of "i".
>> 
>> However, even with that fixed, vect_analyze_slp later falls back on
>> single-lane SLP with load/store lanes.  I think that heuristic too
>> should use vect_slp_prefer_store_lanes_p (but it otherwise looks good).
>> 
>> The question is whether every load should pass
>> vect_slp_prefer_store_lanes_p or whether just one is enough.
>> I don't have an example that would make the call either way,
>> so I went for the latter, given that it's the smaller change
>> from the status quo.
>> 
>> This also appears to improve fotonik3d and roms from SPEC2017
>> (cross-checked against two different systems).
>> 
>> Bootstrapped & regression-tested on aarch64-linux-gnu, where it
>> fixes the pr99873_2.c regression.  OK to install?
>
> risc-v is also heavily depending on load/store-lanes so please run
> this throug the riscv CI as well.
>
> Otherwise this is all heuristics, so I'm fine with this change.


Thanks!  https://github.com/ewlu/gcc-precommit-ci/issues/3160
showed no new regressions, so I went ahead and pushed the patch.

Richard

Re: [PATCH] vect: Fix aarch64/pr99873_2.c ld4/st4 failure

Reply via email to