Re: [PATCH 1/2] aarch64: Improve vector constant generation using SVE INDEX instruction [PR113328]

Kyrylo Tkachov Tue, 17 Sep 2024 02:03:53 -0700


> On 17 Sep 2024, at 10:52, Richard Biener <[email protected]> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, Sep 17, 2024 at 9:57 AM Kyrylo Tkachov <[email protected]> wrote:
>> 
>> 
>> 
>>> On 16 Sep 2024, at 16:32, Richard Sandiford <[email protected]> 
>>> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> "Pengxuan Zheng (QUIC)" <[email protected]> writes:
>>>>> On Thu, Sep 12, 2024 at 2:53 AM Pengxuan Zheng
>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> SVE's INDEX instruction can be used to populate vectors by values
>>>>>> starting from "base" and incremented by "step" for each subsequent
>>>>>> value. We can take advantage of it to generate vector constants if
>>>>>> TARGET_SVE is available and the base and step values are within [-16, 
>>>>>> 15].
>>>>> 
>>>>> Are there multiplication by or addition of scalar immediate instructions 
>>>>> to
>>>>> enhance this with two-instruction sequences?
>>>> 
>>>> No, Richard, I can't think of any equivalent two-instruction sequences.
>>> 
>>> There are some.  E.g.:
>>> 
>>>    { 16, 17, 18, 19, ... }
>>> 
>>> could be:
>>> 
>>>       index   z0.b, #0, #1
>>>       add     z0.b, z0.b, #16
>>> 
>>> or, alternatively:
>>> 
>>>       mov     w0, #16
>>>       index   z0.b, w0, #1
> 
> I guess even step between [16, 31] could be handed with index with half
> step and then adding the result to itself (multiply by immediate #2), even
> if there's no direct vector-by-immediate instruction available.  Likewise
> of course  some { A0 + n * B1 + n * B2, ... } can be handled by adding
> two index compute results.


There are some such by-immediate instructions in SVE that we could try, but 
each one would need to be carefully evaluated as their latencies and 
throughputs may vary.


> 
>>> But these cases are less obviously a win, so I think it's ok to handle
>>> single instructions only for now.
>> 
>> (Not related to this patch, this work is great, thanks Pengxuan!)
>> Looking at some SWOGs like for Neoverse V2 it looks like the first sequence 
>> is preferable.
>> On that core the INDEX-immediates-only operation has latency 4 and 
>> throughput 2 and the SVE ADD is as cheap as SIMD operations can be on that 
>> core.
>> But in the second sequence the INDEX-reg-operand has latency 7 and 
>> throughput 1 as it seems to treat it as a GP <-> SIMD transfer of some sort.
> 
> So what's the latency/throughput of a vector load from constant pool
> (can we even have a "SVE" constant pool?  I assume
> entries would have to be of the architecturally largest vector size?),
> assuming it's in L1 (where it would occupy quite some
> space eventually).

In this thread we’re talking about implementing fixed-length 128-bit “Neon”/GCC 
vector extension operations with SVE instructions rather than VLA SVE constants 
as SVE has some useful instructions that, applied to the bottom 128 bits can do 
things that plain Neon can’t. So the constant-pool alternative is a simple Neon 
address generation+load.
I haven’t thought through the SVE constant creation story yet.
From what I can tell the vector load of a Neon register has a latency of 6 or 7 
cycles (throughput 3) and the ADRP for address generation is very fast 
(latency/throughput: 1/4)
Thanks,
Kyrill

> 
> Richard.
> 
>> We may encounter a situation in the future where we’ll want to optimize the 
>> second sequence (if it comes from intrinsics code for example) into the 
>> first.
>> Thanks,
>> Kyrill
>> 
>> 
>>> 
>>> The patch is ok for trunk, thanks, but:
>>> 
>>>>>> @@ -22991,7 +22991,7 @@ aarch64_simd_valid_immediate (rtx op,
>>>>> simd_immediate_info *info,
>>>>>>  if (CONST_VECTOR_P (op)
>>>>>>      && CONST_VECTOR_DUPLICATE_P (op))
>>>>>>    n_elts = CONST_VECTOR_NPATTERNS (op);
>>>>>> -  else if ((vec_flags & VEC_SVE_DATA)
>>>>>> +  else if (which == AARCH64_CHECK_MOV && TARGET_SVE
>>>>>>          && const_vec_series_p (op, &base, &step))
>>> 
>>> ...the convention is to have one && condition per line if the whole
>>> expression doesn't fit on a single line:
>>> 
>>> else if (which == AARCH64_CHECK_MOV
>>>          && TARGET_SVE
>>>          && const_vec_series_p (op, &base, &step))
>>> 
>>> Richard
>>

Re: [PATCH 1/2] aarch64: Improve vector constant generation using SVE INDEX instruction [PR113328]

Reply via email to