> On 25 Oct 2024, at 15:25, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>>> On 25 Oct 2024, at 13:46, Richard Sandiford <richard.sandif...@arm.com> 
>>> wrote:
>>> 
>>> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>>>> Thank you for the suggestions! I’m trying them out now.
>>>> 
>>>>>> +  if (rotamnt % BITS_PER_UNIT != 0)
>>>>>> +    return NULL_RTX;
>>>>>> +  machine_mode qimode;
>>>>>> +  if (!qimode_for_vec_perm (mode).exists (&qimode))
>>>>>> +    return NULL_RTX;
>>>>>> +
>>>>>> +  vec_perm_builder builder;
>>>>>> +  unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode));
>>>>> 
>>>>> simpler as GET_MODE_UNIT_SIZE
>>>>> 
>>>>>> +  unsigned total_units;
>>>>>> +  /* TODO: Handle VLA vector rotates?  */
>>>>>> +  if (!GET_MODE_SIZE (mode).is_constant (&total_units))
>>>>>> +    return NULL_RTX;
>>>>> 
>>>>> Yeah.  I think we can do that by changing:
>>>>> 
>>>>>> +  builder.new_vector (total_units, 1, total_units);
>>>>> 
>>>>> to:
>>>>> 
>>>>> builder.new_vector (total_units, 3, units);
>>>> 
>>>> I think units here is the size in units of the fixed-width component of 
>>>> the mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI?
>>> 
>>> Ah, no, sorry, I meant "nunits" rather than "units", with "nunits"
>>> being the same as for your code.  So for V4SI and VNx4SI we'd push
>>> 12 elements total, as 4 (nunits) "patterns" of 3 elements each.
>>> The first argument (total_units) is just GET_MODE_SIZE (mode)
>>> in all its poly_int glory.
>> 
>> Hmm, I’m afraid I’m lost again. For V4SI we have a vector of 16 bytes, how 
>> can 12 indices be enough to describe the permute?
>> With this scheme we do end up pushing 12 elements, in the order: 
>> 2,3,0,1,6,7,4,5,10,11,8,9 .
>> In the final RTX emitted in the instruction stream this seems to end up as:
>>        (const_vector:V16QI [
>>                (const_int 2 [0x2])
>>                (const_int 3 [0x3])
>>                (const_int 0 [0])
>>                (const_int 1 [0x1])
>>                (const_int 6 [0x6])
>>                (const_int 7 [0x7])
>>                (const_int 4 [0x4])
>>                (const_int 5 [0x5])
>>                (const_int 10 [0xa])
>>                (const_int 11 [0xb])
>>                (const_int 8 [0x8])
>>                (const_int 9 [0x9]) repeated x2
>>                (const_int 14 [0xe])
>>                (const_int 7 [0x7])
>>                (const_int 0 [0])
>>            ])
>> 
>> So the first 12 elements are indeed correct, but the last 4 elements are not.
> 
> Gah, sorry, I got the arguments the wrong way around.  It should be:
> 
>   builder.new_vector (GET_MODE_SIZE (mode), nunits, 3);
> 
> (4 patterns, 3 elements per pattern)
> 

Thanks! That works.
I’ve resubmitted a fixed patch with 
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/666608.html (along with 
other updates in the series)
Kyrill

> Thanks,
> Richard

Reply via email to