> On 25 Oct 2024, at 15:25, Richard Sandiford <richard.sandif...@arm.com> wrote: > > Kyrylo Tkachov <ktkac...@nvidia.com> writes: >>> On 25 Oct 2024, at 13:46, Richard Sandiford <richard.sandif...@arm.com> >>> wrote: >>> >>> Kyrylo Tkachov <ktkac...@nvidia.com> writes: >>>> Thank you for the suggestions! I’m trying them out now. >>>> >>>>>> + if (rotamnt % BITS_PER_UNIT != 0) >>>>>> + return NULL_RTX; >>>>>> + machine_mode qimode; >>>>>> + if (!qimode_for_vec_perm (mode).exists (&qimode)) >>>>>> + return NULL_RTX; >>>>>> + >>>>>> + vec_perm_builder builder; >>>>>> + unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode)); >>>>> >>>>> simpler as GET_MODE_UNIT_SIZE >>>>> >>>>>> + unsigned total_units; >>>>>> + /* TODO: Handle VLA vector rotates? */ >>>>>> + if (!GET_MODE_SIZE (mode).is_constant (&total_units)) >>>>>> + return NULL_RTX; >>>>> >>>>> Yeah. I think we can do that by changing: >>>>> >>>>>> + builder.new_vector (total_units, 1, total_units); >>>>> >>>>> to: >>>>> >>>>> builder.new_vector (total_units, 3, units); >>>> >>>> I think units here is the size in units of the fixed-width component of >>>> the mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI? >>> >>> Ah, no, sorry, I meant "nunits" rather than "units", with "nunits" >>> being the same as for your code. So for V4SI and VNx4SI we'd push >>> 12 elements total, as 4 (nunits) "patterns" of 3 elements each. >>> The first argument (total_units) is just GET_MODE_SIZE (mode) >>> in all its poly_int glory. >> >> Hmm, I’m afraid I’m lost again. For V4SI we have a vector of 16 bytes, how >> can 12 indices be enough to describe the permute? >> With this scheme we do end up pushing 12 elements, in the order: >> 2,3,0,1,6,7,4,5,10,11,8,9 . >> In the final RTX emitted in the instruction stream this seems to end up as: >> (const_vector:V16QI [ >> (const_int 2 [0x2]) >> (const_int 3 [0x3]) >> (const_int 0 [0]) >> (const_int 1 [0x1]) >> (const_int 6 [0x6]) >> (const_int 7 [0x7]) >> (const_int 4 [0x4]) >> (const_int 5 [0x5]) >> (const_int 10 [0xa]) >> (const_int 11 [0xb]) >> (const_int 8 [0x8]) >> (const_int 9 [0x9]) repeated x2 >> (const_int 14 [0xe]) >> (const_int 7 [0x7]) >> (const_int 0 [0]) >> ]) >> >> So the first 12 elements are indeed correct, but the last 4 elements are not. > > Gah, sorry, I got the arguments the wrong way around. It should be: > > builder.new_vector (GET_MODE_SIZE (mode), nunits, 3); > > (4 patterns, 3 elements per pattern) >
Thanks! That works. I’ve resubmitted a fixed patch with https://gcc.gnu.org/pipermail/gcc-patches/2024-October/666608.html (along with other updates in the series) Kyrill > Thanks, > Richard