> On 25 Oct 2024, at 13:46, Richard Sandiford <richard.sandif...@arm.com> wrote: > > Kyrylo Tkachov <ktkac...@nvidia.com> writes: >> Thank you for the suggestions! I’m trying them out now. >> >>>> + if (rotamnt % BITS_PER_UNIT != 0) >>>> + return NULL_RTX; >>>> + machine_mode qimode; >>>> + if (!qimode_for_vec_perm (mode).exists (&qimode)) >>>> + return NULL_RTX; >>>> + >>>> + vec_perm_builder builder; >>>> + unsigned nunits = GET_MODE_SIZE (GET_MODE_INNER (mode)); >>> >>> simpler as GET_MODE_UNIT_SIZE >>> >>>> + unsigned total_units; >>>> + /* TODO: Handle VLA vector rotates? */ >>>> + if (!GET_MODE_SIZE (mode).is_constant (&total_units)) >>>> + return NULL_RTX; >>> >>> Yeah. I think we can do that by changing: >>> >>>> + builder.new_vector (total_units, 1, total_units); >>> >>> to: >>> >>> builder.new_vector (total_units, 3, units); >> >> I think units here is the size in units of the fixed-width component of the >> mode? So e.g. 16 for V4SI and VNx4SI but 8 for V4HI and VN4HI? > > Ah, no, sorry, I meant "nunits" rather than "units", with "nunits" > being the same as for your code. So for V4SI and VNx4SI we'd push > 12 elements total, as 4 (nunits) "patterns" of 3 elements each. > The first argument (total_units) is just GET_MODE_SIZE (mode) > in all its poly_int glory.
Hmm, I’m afraid I’m lost again. For V4SI we have a vector of 16 bytes, how can 12 indices be enough to describe the permute? With this scheme we do end up pushing 12 elements, in the order: 2,3,0,1,6,7,4,5,10,11,8,9 . In the final RTX emitted in the instruction stream this seems to end up as: (const_vector:V16QI [ (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 0 [0]) (const_int 1 [0x1]) (const_int 6 [0x6]) (const_int 7 [0x7]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 10 [0xa]) (const_int 11 [0xb]) (const_int 8 [0x8]) (const_int 9 [0x9]) repeated x2 (const_int 14 [0xe]) (const_int 7 [0x7]) (const_int 0 [0]) ]) So the first 12 elements are indeed correct, but the last 4 elements are not. > > For V4HI and VNx4HI it would be (..., 3, 2), so 6 elements total. > >> What is the recommended API for getting that number out of the poly_uint64 >> mode size?. Is it just accessing coeffs[0]? >> >>> >>> unconditionally and making the outer loop below iterate exactly >>> three times (i.e. to nunits * 3). It's ok if we generate more >>> indices than needed. >>> >>>> + int rot_to_perm = nunits - rotamnt / BITS_PER_UNIT; >>>> + for (unsigned j = 0; j < total_units; j += nunits) >>>> + for (unsigned i = 0; i < nunits; i++) >>>> + { >>>> + unsigned idx = (rot_to_perm + i) % nunits + j; >>>> + if (BYTES_BIG_ENDIAN) >>>> + idx = total_units - idx - 1; >>> >>> I think the endian adjustment should be local to the inner loop/vector >>> element only. Since this would mean undoing the "nunits - " adjustment >>> above, how about something like: >>> >>> unsigned rot_bytes = rotamnt / BITS_PER_UNIT; >>> unsigned rot_to_perm = BYTES_BIG_ENDIAN ? rot_bytes : nunits - rot_bytes; >>> ... >>> builder.quick_push ((rot_to_perm + i) % nunits + j); >>> >>> or whatever variation you prefer. >>> >>> Hope I've got that right... >> >> >> Hmm, I’m getting some test failures and wrong indices when I try this. >> I think I can work out the indices and the loops for them but I’d like to >> work >> through an example. So say we are rotating a V4SImode vector by 16 (a REV32 >> instruction). >> The indices pushed into the byte permute vector with my original patch are: >> [2,3,0,1, 6,7,4,5, a,b,8,9, e,f,c,d] >> What sequence do we want to push for V4SImode now that we have 3 patterns in >> vector_builder? >> Is it repeating the above 3 times or is it interleaving each SImode entry >> somehow? > > I think the calculation should be the same as in your original code, > adding "i" to each element. It's just that we limit the outer loop > to 3 iterations instead of total_units / nunits. > > So the end result should be that the pushed elements are the same > as in your original patch, just longer or shorter, depending on > whether we're pushing more elements (e.g. V2DI) or fewer (e.g. V4SI). > > In general, the VLA constant scheme works by creating the leading > elements as normal and then describing how to extend that sequence > using the (npatterns, nelts_per_pattern) pair. The first > npatterns * nelts_per_pattern elements are always explicitly pushed, > but are automatically extended or truncated as necessary when creating > fixed-length tree and rtl constants. > > Thanks, > Richard