Re: [PATCH][tree-optimization/61338] - Optimize redundant reverse permutations in vectorized stores

Kugan Vivekanandarajah Wed, 22 Oct 2025 03:14:35 -0700


> On 22 Oct 2025, at 8:32 pm, Tamar Christina <[email protected]> wrote:
>
> External email: Use caution opening links or attachments
>
>
>> -----Original Message-----
>> From: Richard Biener <[email protected]>
>> Sent: 22 October 2025 09:53
>> To: Kugan Vivekanandarajah <[email protected]>
>> Cc: [email protected]; Tamar Christina <[email protected]>
>> Subject: Re: [PATCH][tree-optimization/61338] - Optimize redundant reverse
>> permutations in vectorized stores
>>
>> On Tue, Oct 21, 2025 at 11:57 PM Kugan Vivekanandarajah
>> <[email protected]> wrote:
>>>
>>> Hi Richard,
>>>
>>> Thanks for the review.
>>>
>>>> On 15 Oct 2025, at 10:39 pm, Richard Biener
>> <[email protected]> wrote:
>>>>
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> On Wed, Oct 15, 2025 at 12:08 AM Kugan Vivekanandarajah
>>>> <[email protected]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> This patch eliminates redundant reverse permutations in vectorized
>> reverse
>>>>> loops by detecting and optimizing patterns during store vectorization.
>>>>>
>>>>> The reverse load (b[i]) generates PERM, operations are applied, then the
>>>>> reverse store adds another PERM. This creates redundant permute pairs
>> that
>>>>> we now detect and eliminate.
>>>>>
>>>>> With the patch, for the example loop
>>>>> for (int i = N - 1; i >= 0; i--)
>>>>>   {
>>>>>     a[i] = b[i] + 1.0f;
>>>>>   }
>>>>> Changes to the following
>>>>> -       ldr     q29, [x0, x2]
>>>>> -       tbl     v29.16b, {v29.16b}, v31.16b
>>>>> -       fadd    v29.4s, v29.4s, v30.4s
>>>>> -       tbl     v29.16b, {v29.16b}, v31.16b
>>>>> -       str     q29, [x3, x2]
>>>>> +       ldr     q30, [x0, x2]
>>>>> +       fadd    v30.4s, v30.4s, v31.4s
>>>>> +       str     q30, [x3, x2]
>>>>
>>>> So this works basically as a post-processing optimization at the time
>>>> we generate the
>>>> vector store.  While that's in principle an OK optimization I'd rather
>>>> have such post-processing
>>>> implemented outside of the vectorizer because then also permutes not
>>>> originating from
>>>> vectorizer permuted store generation would benefit.
>>>>
>>>> As for implementing this in the vectorizer itself the more appropriate
>>>> thing would be
>>>> to expose these permutes to the permute optimization phase, because
>> then it can
>>>> be also taken into account during costing and a reverse load permute
>>>> could be elided
>>>> if it feeds an associatable reduction.
>>>>
>>>> There is, unfortunately, currently no good way to represent how we
>> implement
>>>> negative strided contiguous accesses with load permutations as the
>> peculiarity
>>>> only exposes itself after applying the VF and load/lane permutations are
>>>> represented on the VF == 1 SLP graph.  One of my ideas what that once we
>>>> settle on VF (and possibly vector types) we want to expand the SLP graph
>>>> to cover all lanes of the vector loop so we can expose actual permutes and
>>>> vector granularity.  This is a bit far off though.
>>>>
>>>> So in line with your patch but more appropriate for in-vectorizer
>>>> operation would
>>>> be an analysis on the SLP graph that simply marks reverse permutes that
>> can
>>>> be elided (for the back-to-back case).  This way both costing and code
>>>> generation
>>>> can take this into account and you wouldn't have to adjust any stmts.
>>>
>>> I  have now changed it to account for the costing. Bootstrapped and
>> regression tested on aarch64-linux-gnu.
>>>
>>> Is this OK?
>>
>> Same here?
>>
>
> Did you send the right version of the patch Kugan? It's identical to the one 
> you
> sent before and also has some changes in gcc/fortran/resolve.cc not specified
> and your changelog seems to have an incorrect format, the files containing 
> what
> you changed aren't mentioned.


Apologies again. Here is the correct version. This also does not have the 
changes for resolve.cc <http://resolve.cc/>.

Thanks,
Kugan




>
> Thanks,
> Tamar
>
>>> Thanks,
>>> Kugan
>>>
>>>
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>>>       PR tree-optimization/61338
>>>>>
>>>>> gcc/ChangeLog:
>>>>>       (get_vector_perm_operand): New.
>>>>>       (vect_find_reverse_permute_operand): New  helper function
>>>>>       to find reverse permutations through element-wise operation chains.
>>>>>       Returns true only if ALL operands have reverse permutations.
>>>>>       (vectorizable_store): Use recursive helper to eliminate redundant
>>>>>       reverse permutations with configurable search depth.
>>>>>
>>>>> gcc/testsuite/ChangeLog:
>>>>>
>>>>>       * gcc.dg/vect/slp-permute-reverse-1.c: New test for basic
>>>>>       reverse permute optimization (simple copy).
>>>>>       * gcc.dg/vect/slp-permute-reverse-2.c: New runtime test for
>>>>>       basic pattern.
>>>>> Signed-off-by: Kugan Vivekanandarajah <[email protected]>
>>>>>
>>>>> Bootstrapped and regression tested on aarch64-linux-gcc. Is this OK?
>>>>>
>>>>> Thanks,
>>>>> Kugan
>>>
>>>

0001-tree-optimization-61338-v2-Optimize-redundant-revers.patch
Description: 0001-tree-optimization-61338-v2-Optimize-redundant-revers.patch

Re: [PATCH][tree-optimization/61338] - Optimize redundant reverse permutations in vectorized stores

Reply via email to