Re: [RFC][AArch64] Defining lrotm3 optabs for SVE modes for TARGET_SVE2?

Kyrylo Tkachov via Gcc Mon, 21 Oct 2024 01:32:26 -0700


> On 18 Oct 2024, at 19:46, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> Hello,
>> 
>> I’ve been optimizing various code sequences relating to vector rotates 
>> recently.
>> I ended up proposing we expand the vector-rotate-by-immediate optab rotlm3 
>> for
>> the Advanced SIMD (Neon) modes here:
>> https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665635.html
>> This expands to a ROTATE RTL code that can be later combined into more 
>> complex
>> instructions like XAR and for certain rotate amounts can be optimized in a 
>> single instruction.
>> If they fail to be optimized then a splitter breaks it down into an SHL + 
>> USRA pair.
>> 
>> For SVE, because we have predicates in the general case it’s not feasible to 
>> detect
>> these rotates at the RTL level, so I was hoping that GIMPLE could do it, and 
>> indeed
>> GIMPLE has many places where it can detect rotate idioms: forwprop1, bswap 
>> detection,
>> pattern matching in the vectorizer, match.pd for simple cases etc.
>> The vectorizer is probably a good place to do it (rather than asking the 
>> other places to deal
>> with VLA types) but I think it would need the target to affirm that it 
>> supports SVE vector rotates
>> through the lrotm3 optab, hence my question. 
>> 
>> Though some rotate amounts can be implemented with a single instruction 
>> (REVB, REVH, REVW),
>> the fallback expansion for TARGET_SVE2 would be a two-instruction LSL+USRA 
>> which is better than
>> what we currently emit in the motivating test case:
>> https://godbolt.org/z/o55or8hYv
>> We currently cannot combine the LSL+LSR+ORR sequence because the predicates 
>> get in the way during
>> combine (even though the instructions involved are actually unpredicated and 
>> the predicate would get
>> dropped later anyway).
>> It would also allow us to keep an RTL-level ROTATE long enough to combine it 
>> into the XAR and RAX
>> instructions from TARGET_SVE2_SHA3.
>> 
>> Finally, it would allow us to experiment with more optimal SVE-specific 
>> rotate sequences in the future.
>> For example, we could consider emitting high-throughput TBLs for rotates 
>> that are a multiple of 8.
>> 
>> I’m suggesting doing this for TARGET_SVE2 as we have the combined USRA 
>> instruction there,
>> but I wouldn’t object doing this for TARGET_SVE.
> 
> I think there are three cases here:
> 
> (1) Using permutes for rotates.  That part on its own could be a
>    target-independent optimisation.  I imagine other targets without
>    native rotate support would benefit.


It seems to me that this is something to be done at (generic) expand-time?
Or do you think it’s something the vectorizer should be doing during its 
detection of rotates?
I suppose it’s easiest for the vectorizer to generate the IR for it but 
expand-time may be a better place
to query the target given there are nuances in the selection (see below)...


> 
> (2) Encouraging the use of XAR.  I suppose the question here is:
>    is XAR so good that can we consider using it instead of LSL/USRA
>    even when the XOR part isn't needed?  That is, when XAR is available,
>    one way of implementing the rotate optab would be to zero the
>    destination register (hopefully free) and then use XAR itself as
>    the rotate instruction.
> 
>    If that's a win, then defining the optab like that sounds good.
> 
>    If it's not a win, then we could end up being too aggressive about
>    forming XAR in general, since XORs fold with other things too.

I think using XAR to implement rotates is a good idea, but a number of 
complexities come to mind (and these seem to apply to Advanced SIMD too) that 
make it a less universal solution than I’d hoped:
* Advanced SIMD XAR is only available for SHA3 sub-targets, we may need a 
separate code path for non-SHA3 codegen.
* Whether the XAR scheme is a win would depend on the CPU being targeted. From 
the optimization guides that I’ve checked the latency of XAR is always good (2 
cycles) but the throughput varies.
On Neoverse V2 it is 1, and so would lose out to a vector permute 
implementation (throughput 4) unless it’s also used to combine away a XOR. On 
Neoverse V3 however the throughput is the maximum 4
and so it would be the preferred way of doing vector rotates in general.
* XAR on Advanced SIMD only supports V2DImode operands. The SVE2 version of XAR 
supports all widths. That means for TARGET_SVE2 we can use the SVE XAR 
instruction even for Neon modes, but we’d still need a reasonable fallback for 
!TARGET_SVE2 rotates.

Sorry, the above is a bit more Advanced SIMD-specific than I had originally 
intended.

> 
> (3) Using LSL+USRA for SVE2.
> 
>    IIUC, one part of the combine issue is the old "should we use IOR,
>    or should we use PLUS?", for cases where both are equivalent.
>    Is that right?  I.e. target-independent code normally expands
>    rotates using two shifts and an ior_optab, but for aarch64 it would
>    be better to use add_optab.  And the only reason that add_optab is
>    better is because we then want to combine the addition with one of
>    the shifts.

The IOR vs XOR vs PLUS issue can be fixed by my proposed patch at:
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665633.html
The blocking issue after that for SVE is the predicate RTXen that appear
in the pattern and block the simplification. I think the predicates go away 
later
in the optimization pipeline but at combine-time they are a problem.
But I’m now thinking… the LSL, USRA instructions have throughput of 2 on the 
big cores
(around half the max) but given we need two of them, it may well be best to use 
XAR instead.


> 
>    If so, then yeah, that does sound too complex to handle in a
>    target-independent way.  But, given (2), it feels like a separate
>    issue from XAR/RAX optimisation.
> 
> I suppose (1) is somewhat in conflict with (2) and (3).  We'd presumably
> still want to use permute-based rotates where possible.  We might even
> want to avoid modelling that case as a rotate rtx in the RTL stream,
> in case we lose the nice permute to some "simplification".  (Not sure
> either way on that last part though.)

Thank you for your thoughts. I’ll experiment a bit more and propose a solution.
Kyrill


> 
> Thanks,
> Richard

Re: [RFC][AArch64] Defining lrotm3 optabs for SVE modes for TARGET_SVE2?

Reply via email to