Re: [PATCH 4/6] aarch64: Optimize vector rotates into REV* instructions where possible

Richard Sandiford Fri, 25 Oct 2024 04:47:13 -0700

Kyrylo Tkachov <ktkac...@nvidia.com> writes:
> Thank you for the suggestions! I’m trying them out now.
>
>>> +  if (rotamnt % BITS_PER_UNIT != 0)
>>> +    return NULL_RTX;
>>> +  machine_mode qimode;
>>> +  if (!qimode_for_vec_perm (mode).exists (&qimode))
>>> +    return NULL_RTX;
>>> +
>>> +  vec_perm_builder builder;
>>> +  unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode));
>> 
>> simpler as GET_MODE_UNIT_SIZE
>> 
>>> +  unsigned total_units;
>>> +  /* TODO: Handle VLA vector rotates?  */
>>> +  if (!GET_MODE_SIZE (mode).is_constant (&total_units))
>>> +    return NULL_RTX;
>> 
>> Yeah.  I think we can do that by changing:
>> 
>>> +  builder.new_vector (total_units, 1, total_units);
>> 
>> to:
>> 
>>  builder.new_vector (total_units, 3, units);
>
> I think units here is the size in units of the fixed-width component of the 
> mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI?


Ah, no, sorry, I meant "nunits" rather than "units", with "nunits"
being the same as for your code.  So for V4SI and VNx4SI we'd push
12 elements total, as 4 (nunits) "patterns" of 3 elements each.
The first argument (total_units) is just GET_MODE_SIZE (mode)
in all its poly_int glory.

For V4HI and VNx4HI it would be (..., 3, 2), so 6 elements total.

> What is the recommended API for getting that number out of the poly_uint64 
> mode size?. Is it just accessing coeffs[0]?
>
>> 
>> unconditionally and making the outer loop below iterate exactly
>> three times (i.e. to nunits * 3).  It's ok if we generate more
>> indices than needed.
>> 
>>> +  int rot_to_perm = nunits - rotamnt / BITS_PER_UNIT;
>>> +  for (unsigned j = 0; j < total_units; j += nunits)
>>> +    for (unsigned i = 0; i < nunits; i++)
>>> +      {
>>> + unsigned idx = (rot_to_perm + i) % nunits + j;
>>> + if (BYTES_BIG_ENDIAN)
>>> +   idx = total_units - idx - 1;
>> 
>> I think the endian adjustment should be local to the inner loop/vector
>> element only.  Since this would mean undoing the "nunits - " adjustment
>> above, how about something like:
>> 
>>  unsigned rot_bytes = rotamnt / BITS_PER_UNIT;
>>  unsigned rot_to_perm = BYTES_BIG_ENDIAN ? rot_bytes : nunits - rot_bytes;
>>  ...
>>      builder.quick_push ((rot_to_perm + i) % nunits + j);
>> 
>> or whatever variation you prefer.
>> 
>> Hope I've got that right...
>
>
> Hmm, I’m getting some test failures and wrong indices when I try this.
> I think I can work out the indices and the loops for them but I’d like to work
> through an example. So say we are rotating a V4SImode vector by 16 (a REV32 
> instruction).
> The indices pushed into the byte permute vector with my original patch are:
> [2,3,0,1, 6,7,4,5, a,b,8,9, e,f,c,d]
> What sequence do we want to push for V4SImode now that we have 3 patterns in 
> vector_builder?
> Is it repeating the above 3 times or is it interleaving each SImode entry 
> somehow?

I think the calculation should be the same as in your original code,
adding "i" to each element.  It's just that we limit the outer loop
to 3 iterations instead of total_units / nunits.

So the end result should be that the pushed elements are the same
as in your original patch, just longer or shorter, depending on
whether we're pushing more elements (e.g. V2DI) or fewer (e.g. V4SI).

In general, the VLA constant scheme works by creating the leading
elements as normal and then describing how to extend that sequence
using the (npatterns, nelts_per_pattern) pair.  The first
npatterns * nelts_per_pattern elements are always explicitly pushed,
but are automatically extended or truncated as necessary when creating
fixed-length tree and rtl constants.

Thanks,
Richard

Re: [PATCH 4/6] aarch64: Optimize vector rotates into REV* instructions where possible

Reply via email to