Kyrylo Tkachov <ktkac...@nvidia.com> writes: >> On 25 Oct 2024, at 13:46, Richard Sandiford <richard.sandif...@arm.com> >> wrote: >> >> Kyrylo Tkachov <ktkac...@nvidia.com> writes: >>> Thank you for the suggestions! I’m trying them out now. >>> >>>>> + if (rotamnt % BITS_PER_UNIT != 0) >>>>> + return NULL_RTX; >>>>> + machine_mode qimode; >>>>> + if (!qimode_for_vec_perm (mode).exists (&qimode)) >>>>> + return NULL_RTX; >>>>> + >>>>> + vec_perm_builder builder; >>>>> + unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode)); >>>> >>>> simpler as GET_MODE_UNIT_SIZE >>>> >>>>> + unsigned total_units; >>>>> + /* TODO: Handle VLA vector rotates? */ >>>>> + if (!GET_MODE_SIZE (mode).is_constant (&total_units)) >>>>> + return NULL_RTX; >>>> >>>> Yeah. I think we can do that by changing: >>>> >>>>> + builder.new_vector (total_units, 1, total_units); >>>> >>>> to: >>>> >>>> builder.new_vector (total_units, 3, units); >>> >>> I think units here is the size in units of the fixed-width component of the >>> mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI? >> >> Ah, no, sorry, I meant "nunits" rather than "units", with "nunits" >> being the same as for your code. So for V4SI and VNx4SI we'd push >> 12 elements total, as 4 (nunits) "patterns" of 3 elements each. >> The first argument (total_units) is just GET_MODE_SIZE (mode) >> in all its poly_int glory. > > Hmm, I’m afraid I’m lost again. For V4SI we have a vector of 16 bytes, how > can 12 indices be enough to describe the permute? > With this scheme we do end up pushing 12 elements, in the order: > 2,3,0,1,6,7,4,5,10,11,8,9 . > In the final RTX emitted in the instruction stream this seems to end up as: > (const_vector:V16QI [ > (const_int 2 [0x2]) > (const_int 3 [0x3]) > (const_int 0 [0]) > (const_int 1 [0x1]) > (const_int 6 [0x6]) > (const_int 7 [0x7]) > (const_int 4 [0x4]) > (const_int 5 [0x5]) > (const_int 10 [0xa]) > (const_int 11 [0xb]) > (const_int 8 [0x8]) > (const_int 9 [0x9]) repeated x2 > (const_int 14 [0xe]) > (const_int 7 [0x7]) > (const_int 0 [0]) > ]) > > So the first 12 elements are indeed correct, but the last 4 elements are not.
Gah, sorry, I got the arguments the wrong way around. It should be: builder.new_vector (GET_MODE_SIZE (mode), nunits, 3); (4 patterns, 3 elements per pattern) Thanks, Richard