Andrew Pinski <pins...@gmail.com> writes: > On Fri, Dec 27, 2024 at 3:19 AM Robin Dapp <rdapp....@gmail.com> wrote: >> >> Thanks for the helpful suggestion. The attached v2 patch tries to implement >> it. >> >> It was bootstrapped and regtested on x86, aarch64 and Power 10. >> Also regtested on rv64gcv_zvl512b. >> >> Those are all little-endian (sub)targets, though, so it would certainly be >> helpful if you could run additional aarch64 big-endian tests. I haven't seen >> any scan-assembler failures so far. >> >> Regards >> Robin >> >> >> [PATCH v2] varasm: Use native_encode_rtx for constant vectors. >> >> optimize_constant_pool hashes vector masks by native_encode_rtx and >> merges identically hashed values in the constant pool. Afterwards the >> optimized values are written in output_constant_pool_2. >> >> However, native_encode_rtx and output_constant_pool_2 disagree in their >> encoding of vector masks: native_encode_rtx does not pad with zeroes >> while output_constant_pool_2 implicitly does. >> >> In RVV's shuffle-evenodd-run.c there are two masks >> (a) "0101" for V4BI >> (b) "01010101" for V8BI and >> that have the same representation/encoding ("1010101") in native_encode_rtx. >> output_constant_pool_2 uses "101" for (a) and "1010101" for (b). >> >> Now, optimize_constant_pool might happen to merge both masks using >> (a) as representative. Then, output_constant_pool_2 will output "1010" >> which is only valid for the second mask as the implicit zero padding >> doesn't agree with (b). >> (b)'s "1010101" works for both masks as a V4BI load will ignore the last four >> padding bits. >> >> This patch makes output_constant_pool_2 use native_encode_rtx so both >> functions will agree on an encoding and output the correct constant. >> >> gcc/ChangeLog: >> >> * varasm.cc (output_constant_pool_2): Use native_encode_rtx for >> building the memory image of a const vector mask. >> --- >> gcc/varasm.cc | 40 +++++++++++++--------------------------- >> 1 file changed, 13 insertions(+), 27 deletions(-) >> >> diff --git a/gcc/varasm.cc b/gcc/varasm.cc >> index 0068ec2ce4d..9e86efcd178 100644 >> --- a/gcc/varasm.cc >> +++ b/gcc/varasm.cc >> @@ -4301,34 +4301,20 @@ output_constant_pool_2 (fixed_size_mode mode, rtx x, >> unsigned int align) >> { >> gcc_assert (GET_CODE (x) == CONST_VECTOR); >> >> - /* Pick the smallest integer mode that contains at least one >> - whole element. Often this is byte_mode and contains more >> - than one element. */ >> - unsigned int nelts = GET_MODE_NUNITS (mode); >> - unsigned int elt_bits = GET_MODE_PRECISION (mode) / nelts; >> - unsigned int int_bits = MAX (elt_bits, BITS_PER_UNIT); >> - scalar_int_mode int_mode = int_mode_for_size (int_bits, 0).require >> (); >> - unsigned int mask = GET_MODE_MASK (GET_MODE_INNER (mode)); >> - >> - /* We allow GET_MODE_PRECISION (mode) <= GET_MODE_BITSIZE (mode) but >> - only properly handle cases where the difference is less than a >> - byte. */ >> - gcc_assert (GET_MODE_BITSIZE (mode) - GET_MODE_PRECISION (mode) < >> - BITS_PER_UNIT); >> - >> - /* Build the constant up one integer at a time. */ >> - unsigned int elts_per_int = int_bits / elt_bits; >> - for (unsigned int i = 0; i < nelts; i += elts_per_int) >> + auto_vec<target_unit, 128> buffer; > > Let me ask the stupid question, where did 128 magic # come from? I am > suspecting it is the most common vector size in bytes but I am not > 100% sure. Maybe a comment would be a good idea. Or should it be 512 > for the "max" known vector size on x86_64? (256 exists for AVX, 512 > exists but AVX512 is not as common as AVX).
I don't think we need to worry too much about the exact number. It's supposed to be a good finger-in-the-air compromise between avoiding a heap allocation in common cases and not being so big that it impacts cache utilisation. 128 is used elsewhere and seems like a reasonable value for a char array. >> + buffer.truncate (0); > I don't think you need a truncate here since the vect size is originally 0. Yeah, agreed. > > Thanks, > Andrew > >> + buffer.reserve (GET_MODE_SIZE (mode)); >> + >> + bool ok = native_encode_rtx (mode, x, buffer, 0, GET_MODE_SIZE >> (mode)); >> + gcc_assert (ok); >> + >> + for (unsigned i = 0; >> + i < GET_MODE_SIZE (mode) / GET_MODE_SIZE (byte_mode); The divisor is by definition 1. I think dropping it would make the loop more obviously correct, since the same assumption is implicit in the loop body. >> + i++) >> { >> - unsigned HOST_WIDE_INT value = 0; >> - unsigned int limit = MIN (nelts - i, elts_per_int); >> - for (unsigned int j = 0; j < limit; ++j) >> - { >> - auto elt = INTVAL (CONST_VECTOR_ELT (x, i + j)); >> - value |= (elt & mask) << (j * elt_bits); >> - } >> - output_constant_pool_2 (int_mode, gen_int_mode (value, int_mode), >> - i != 0 ? MIN (align, int_bits) : align); >> + unsigned HOST_WIDE_INT value = buffer[i]; >> + output_constant_pool_2 (byte_mode, gen_int_mode (value, >> byte_mode), >> + i == 0 ? 0 : 1); This should be i == 0 ? align : 1 OK with those three changes, thanks. I tested on aarch64 and similarly saw no regressions. I don't think we need to test big-endian aarch64 specifically. Richard >> } >> break; >> } >> -- >> 2.47.1 >>