Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> On 25 Oct 2024, at 13:46, Richard Sandiford <richard.sandif...@arm.com> 
>> wrote:
>> 
>> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>>> Thank you for the suggestions! I’m trying them out now.
>>> 
>>>>> +  if (rotamnt % BITS_PER_UNIT != 0)
>>>>> +    return NULL_RTX;
>>>>> +  machine_mode qimode;
>>>>> +  if (!qimode_for_vec_perm (mode).exists (&qimode))
>>>>> +    return NULL_RTX;
>>>>> +
>>>>> +  vec_perm_builder builder;
>>>>> +  unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode));
>>>> 
>>>> simpler as GET_MODE_UNIT_SIZE
>>>> 
>>>>> +  unsigned total_units;
>>>>> +  /* TODO: Handle VLA vector rotates?  */
>>>>> +  if (!GET_MODE_SIZE (mode).is_constant (&total_units))
>>>>> +    return NULL_RTX;
>>>> 
>>>> Yeah.  I think we can do that by changing:
>>>> 
>>>>> +  builder.new_vector (total_units, 1, total_units);
>>>> 
>>>> to:
>>>> 
>>>> builder.new_vector (total_units, 3, units);
>>> 
>>> I think units here is the size in units of the fixed-width component of the 
>>> mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI?
>> 
>> Ah, no, sorry, I meant "nunits" rather than "units", with "nunits"
>> being the same as for your code.  So for V4SI and VNx4SI we'd push
>> 12 elements total, as 4 (nunits) "patterns" of 3 elements each.
>> The first argument (total_units) is just GET_MODE_SIZE (mode)
>> in all its poly_int glory.
>
> Hmm, I’m afraid I’m lost again. For V4SI we have a vector of 16 bytes, how 
> can 12 indices be enough to describe the permute?
> With this scheme we do end up pushing 12 elements, in the order: 
> 2,3,0,1,6,7,4,5,10,11,8,9 .
> In the final RTX emitted in the instruction stream this seems to end up as:
>         (const_vector:V16QI [
>                 (const_int 2 [0x2])
>                 (const_int 3 [0x3])
>                 (const_int 0 [0])
>                 (const_int 1 [0x1])
>                 (const_int 6 [0x6])
>                 (const_int 7 [0x7])
>                 (const_int 4 [0x4])
>                 (const_int 5 [0x5])
>                 (const_int 10 [0xa])
>                 (const_int 11 [0xb])
>                 (const_int 8 [0x8])
>                 (const_int 9 [0x9]) repeated x2
>                 (const_int 14 [0xe])
>                 (const_int 7 [0x7])
>                 (const_int 0 [0])
>             ])
>
> So the first 12 elements are indeed correct, but the last 4 elements are not.

Gah, sorry, I got the arguments the wrong way around.  It should be:

   builder.new_vector (GET_MODE_SIZE (mode), nunits, 3);

(4 patterns, 3 elements per pattern)

Thanks,
Richard

Reply via email to