Re: [PATCH 4/6] aarch64: Optimize vector rotates into REV* instructions where possible

Kyrylo Tkachov Fri, 25 Oct 2024 05:48:53 -0700


> On 25 Oct 2024, at 13:46, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> Thank you for the suggestions! I’m trying them out now.
>> 
>>>> +  if (rotamnt % BITS_PER_UNIT != 0)
>>>> +    return NULL_RTX;
>>>> +  machine_mode qimode;
>>>> +  if (!qimode_for_vec_perm (mode).exists (&qimode))
>>>> +    return NULL_RTX;
>>>> +
>>>> +  vec_perm_builder builder;
>>>> +  unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode));
>>> 
>>> simpler as GET_MODE_UNIT_SIZE
>>> 
>>>> +  unsigned total_units;
>>>> +  /* TODO: Handle VLA vector rotates?  */
>>>> +  if (!GET_MODE_SIZE (mode).is_constant (&total_units))
>>>> +    return NULL_RTX;
>>> 
>>> Yeah.  I think we can do that by changing:
>>> 
>>>> +  builder.new_vector (total_units, 1, total_units);
>>> 
>>> to:
>>> 
>>> builder.new_vector (total_units, 3, units);
>> 
>> I think units here is the size in units of the fixed-width component of the 
>> mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI?
> 
> Ah, no, sorry, I meant "nunits" rather than "units", with "nunits"
> being the same as for your code.  So for V4SI and VNx4SI we'd push
> 12 elements total, as 4 (nunits) "patterns" of 3 elements each.
> The first argument (total_units) is just GET_MODE_SIZE (mode)
> in all its poly_int glory.


Hmm, I’m afraid I’m lost again. For V4SI we have a vector of 16 bytes, how can 
12 indices be enough to describe the permute?
With this scheme we do end up pushing 12 elements, in the order: 
2,3,0,1,6,7,4,5,10,11,8,9 .
In the final RTX emitted in the instruction stream this seems to end up as:
        (const_vector:V16QI [
                (const_int 2 [0x2])
                (const_int 3 [0x3])
                (const_int 0 [0])
                (const_int 1 [0x1])
                (const_int 6 [0x6])
                (const_int 7 [0x7])
                (const_int 4 [0x4])
                (const_int 5 [0x5])
                (const_int 10 [0xa])
                (const_int 11 [0xb])
                (const_int 8 [0x8])
                (const_int 9 [0x9]) repeated x2
                (const_int 14 [0xe])
                (const_int 7 [0x7])
                (const_int 0 [0])
            ])

So the first 12 elements are indeed correct, but the last 4 elements are not.


> 
> For V4HI and VNx4HI it would be (..., 3, 2), so 6 elements total.
> 
>> What is the recommended API for getting that number out of the poly_uint64 
>> mode size?. Is it just accessing coeffs[0]?
>> 
>>> 
>>> unconditionally and making the outer loop below iterate exactly
>>> three times (i.e. to nunits * 3).  It's ok if we generate more
>>> indices than needed.
>>> 
>>>> +  int rot_to_perm = nunits - rotamnt / BITS_PER_UNIT;
>>>> +  for (unsigned j = 0; j < total_units; j += nunits)
>>>> +    for (unsigned i = 0; i < nunits; i++)
>>>> +      {
>>>> + unsigned idx = (rot_to_perm + i) % nunits + j;
>>>> + if (BYTES_BIG_ENDIAN)
>>>> +   idx = total_units - idx - 1;
>>> 
>>> I think the endian adjustment should be local to the inner loop/vector
>>> element only.  Since this would mean undoing the "nunits - " adjustment
>>> above, how about something like:
>>> 
>>> unsigned rot_bytes = rotamnt / BITS_PER_UNIT;
>>> unsigned rot_to_perm = BYTES_BIG_ENDIAN ? rot_bytes : nunits - rot_bytes;
>>> ...
>>>     builder.quick_push ((rot_to_perm + i) % nunits + j);
>>> 
>>> or whatever variation you prefer.
>>> 
>>> Hope I've got that right...
>> 
>> 
>> Hmm, I’m getting some test failures and wrong indices when I try this.
>> I think I can work out the indices and the loops for them but I’d like to 
>> work
>> through an example. So say we are rotating a V4SImode vector by 16 (a REV32 
>> instruction).
>> The indices pushed into the byte permute vector with my original patch are:
>> [2,3,0,1, 6,7,4,5, a,b,8,9, e,f,c,d]
>> What sequence do we want to push for V4SImode now that we have 3 patterns in 
>> vector_builder?
>> Is it repeating the above 3 times or is it interleaving each SImode entry 
>> somehow?
> 
> I think the calculation should be the same as in your original code,
> adding "i" to each element.  It's just that we limit the outer loop
> to 3 iterations instead of total_units / nunits.
> 
> So the end result should be that the pushed elements are the same
> as in your original patch, just longer or shorter, depending on
> whether we're pushing more elements (e.g. V2DI) or fewer (e.g. V4SI).
> 
> In general, the VLA constant scheme works by creating the leading
> elements as normal and then describing how to extend that sequence
> using the (npatterns, nelts_per_pattern) pair.  The first
> npatterns * nelts_per_pattern elements are always explicitly pushed,
> but are automatically extended or truncated as necessary when creating
> fixed-length tree and rtl constants.
> 
> Thanks,
> Richard

Re: [PATCH 4/6] aarch64: Optimize vector rotates into REV* instructions where possible

Reply via email to