On 12/30/24 8:16 AM, Richard Sandiford wrote:
Andrew Pinski <pins...@gmail.com> writes:
On Fri, Dec 27, 2024 at 3:19 AM Robin Dapp <rdapp....@gmail.com> wrote:

Thanks for the helpful suggestion.  The attached v2 patch tries to implement
it.

It was bootstrapped and regtested on x86, aarch64 and Power 10.
Also regtested on rv64gcv_zvl512b.

Those are all little-endian (sub)targets, though, so it would certainly be
helpful if you could run additional aarch64 big-endian tests.  I haven't seen
any scan-assembler failures so far.

Regards
  Robin


[PATCH v2] varasm: Use native_encode_rtx for constant vectors.

optimize_constant_pool hashes vector masks by native_encode_rtx and
merges identically hashed values in the constant pool.  Afterwards the
optimized values are written in output_constant_pool_2.

However, native_encode_rtx and output_constant_pool_2 disagree in their
encoding of vector masks:  native_encode_rtx does not pad with zeroes
while output_constant_pool_2 implicitly does.

In RVV's shuffle-evenodd-run.c there are two masks
   (a) "0101" for V4BI
   (b) "01010101" for V8BI and
that have the same representation/encoding ("1010101") in native_encode_rtx.
output_constant_pool_2 uses "101" for (a) and "1010101" for (b).

Now, optimize_constant_pool might happen to merge both masks using
(a) as representative.  Then, output_constant_pool_2 will output "1010"
which is only valid for the second mask as the implicit zero padding
doesn't agree with (b).
(b)'s "1010101" works for both masks as a V4BI load will ignore the last four
padding bits.

This patch makes output_constant_pool_2 use native_encode_rtx so both
functions will agree on an encoding and output the correct constant.

gcc/ChangeLog:

         * varasm.cc (output_constant_pool_2): Use native_encode_rtx for
         building the memory image of a const vector mask.
---
  gcc/varasm.cc | 40 +++++++++++++---------------------------
  1 file changed, 13 insertions(+), 27 deletions(-)

diff --git a/gcc/varasm.cc b/gcc/varasm.cc
index 0068ec2ce4d..9e86efcd178 100644
--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -4301,34 +4301,20 @@ output_constant_pool_2 (fixed_size_mode mode, rtx x, 
unsigned int align)
        {
         gcc_assert (GET_CODE (x) == CONST_VECTOR);

-       /* Pick the smallest integer mode that contains at least one
-          whole element.  Often this is byte_mode and contains more
-          than one element.  */
-       unsigned int nelts = GET_MODE_NUNITS (mode);
-       unsigned int elt_bits = GET_MODE_PRECISION (mode) / nelts;
-       unsigned int int_bits = MAX (elt_bits, BITS_PER_UNIT);
-       scalar_int_mode int_mode = int_mode_for_size (int_bits, 0).require ();
-       unsigned int mask = GET_MODE_MASK (GET_MODE_INNER (mode));
-
-       /* We allow GET_MODE_PRECISION (mode) <= GET_MODE_BITSIZE (mode) but
-          only properly handle cases where the difference is less than a
-          byte.  */
-       gcc_assert (GET_MODE_BITSIZE (mode) - GET_MODE_PRECISION (mode) <
-                   BITS_PER_UNIT);
-
-       /* Build the constant up one integer at a time.  */
-       unsigned int elts_per_int = int_bits / elt_bits;
-       for (unsigned int i = 0; i < nelts; i += elts_per_int)
+       auto_vec<target_unit, 128> buffer;

Let me ask the stupid question, where did 128 magic # come from? I am
suspecting it is the most common vector size in bytes but I am not
100% sure. Maybe a comment would be a good idea. Or should it be 512
for the "max" known vector size on x86_64? (256 exists for AVX, 512
exists but AVX512 is not as common as AVX).

I don't think we need to worry too much about the exact number.
It's supposed to be a good finger-in-the-air compromise between
avoiding a heap allocation in common cases and not being so
big that it impacts cache utilisation.  128 is used elsewhere and
seems like a reasonable value for a char array.

+       buffer.truncate (0);
I don't think you need a truncate here since the vect size is originally 0.

Yeah, agreed.


Thanks,
Andrew

+       buffer.reserve (GET_MODE_SIZE (mode));
+
+       bool ok = native_encode_rtx (mode, x, buffer, 0, GET_MODE_SIZE (mode));
+       gcc_assert (ok);
+
+       for (unsigned i = 0;
+            i < GET_MODE_SIZE (mode) / GET_MODE_SIZE (byte_mode);

The divisor is by definition 1.  I think dropping it would make the
loop more obviously correct, since the same assumption is implicit in
the loop body.

+            i++)
           {
-           unsigned HOST_WIDE_INT value = 0;
-           unsigned int limit = MIN (nelts - i, elts_per_int);
-           for (unsigned int j = 0; j < limit; ++j)
-           {
-             auto elt = INTVAL (CONST_VECTOR_ELT (x, i + j));
-             value |= (elt & mask) << (j * elt_bits);
-           }
-           output_constant_pool_2 (int_mode, gen_int_mode (value, int_mode),
-                                   i != 0 ? MIN (align, int_bits) : align);
+           unsigned HOST_WIDE_INT value = buffer[i];
+           output_constant_pool_2 (byte_mode, gen_int_mode (value, byte_mode),
+                                   i == 0 ? 0 : 1);

This should be i == 0 ? align : 1

OK with those three changes, thanks.  I tested on aarch64 and similarly
saw no regressions.  I don't think we need to test big-endian aarch64
specifically.
I made the three changed noted above and added a note for the fixed issue in BZ and pushed the final result to the trunk for Robin.

Jeff

commit 509df13fbf0b3544cd39a9e0a5de11ce841bb185
Author: Robin Dapp <rdapp....@gmail.com>
Date:   Mon Dec 30 23:47:53 2024 -0700

    [PATCH v2] varasm: Use native_encode_rtx for constant vectors.
    
    optimize_constant_pool hashes vector masks by native_encode_rtx and
    merges identically hashed values in the constant pool.  Afterwards the
    optimized values are written in output_constant_pool_2.
    
    However, native_encode_rtx and output_constant_pool_2 disagree in their
    encoding of vector masks:  native_encode_rtx does not pad with zeroes
    while output_constant_pool_2 implicitly does.
    
    In RVV's shuffle-evenodd-run.c there are two masks
      (a) "0101" for V4BI
      (b) "01010101" for V8BI and
    that have the same representation/encoding ("1010101") in native_encode_rtx.
    output_constant_pool_2 uses "101" for (a) and "1010101" for (b).
    
    Now, optimize_constant_pool might happen to merge both masks using
    (a) as representative.  Then, output_constant_pool_2 will output "1010"
    which is only valid for the second mask as the implicit zero padding
    doesn't agree with (b).
    (b)'s "1010101" works for both masks as a V4BI load will ignore the last 
four
    padding bits.
    
    This patch makes output_constant_pool_2 use native_encode_rtx so both
    functions will agree on an encoding and output the correct constant.
    
            PR target/118036
    gcc/ChangeLog:
    
            * varasm.cc (output_constant_pool_2): Use native_encode_rtx for
            building the memory image of a const vector mask.

diff --git a/gcc/varasm.cc b/gcc/varasm.cc
index 0068ec2ce4d..507da629619 100644
--- a/gcc/varasm.cc
+++ b/gcc/varasm.cc
@@ -4301,34 +4301,17 @@ output_constant_pool_2 (fixed_size_mode mode, rtx x, 
unsigned int align)
       {
        gcc_assert (GET_CODE (x) == CONST_VECTOR);
 
-       /* Pick the smallest integer mode that contains at least one
-          whole element.  Often this is byte_mode and contains more
-          than one element.  */
-       unsigned int nelts = GET_MODE_NUNITS (mode);
-       unsigned int elt_bits = GET_MODE_PRECISION (mode) / nelts;
-       unsigned int int_bits = MAX (elt_bits, BITS_PER_UNIT);
-       scalar_int_mode int_mode = int_mode_for_size (int_bits, 0).require ();
-       unsigned int mask = GET_MODE_MASK (GET_MODE_INNER (mode));
-
-       /* We allow GET_MODE_PRECISION (mode) <= GET_MODE_BITSIZE (mode) but
-          only properly handle cases where the difference is less than a
-          byte.  */
-       gcc_assert (GET_MODE_BITSIZE (mode) - GET_MODE_PRECISION (mode) <
-                   BITS_PER_UNIT);
-
-       /* Build the constant up one integer at a time.  */
-       unsigned int elts_per_int = int_bits / elt_bits;
-       for (unsigned int i = 0; i < nelts; i += elts_per_int)
+       auto_vec<target_unit, 128> buffer;
+       buffer.reserve (GET_MODE_SIZE (mode));
+
+       bool ok = native_encode_rtx (mode, x, buffer, 0, GET_MODE_SIZE (mode));
+       gcc_assert (ok);
+
+       for (unsigned i = 0; i < GET_MODE_SIZE (mode); i++)
          {
-           unsigned HOST_WIDE_INT value = 0;
-           unsigned int limit = MIN (nelts - i, elts_per_int);
-           for (unsigned int j = 0; j < limit; ++j)
-           {
-             auto elt = INTVAL (CONST_VECTOR_ELT (x, i + j));
-             value |= (elt & mask) << (j * elt_bits);
-           }
-           output_constant_pool_2 (int_mode, gen_int_mode (value, int_mode),
-                                   i != 0 ? MIN (align, int_bits) : align);
+           unsigned HOST_WIDE_INT value = buffer[i];
+           output_constant_pool_2 (byte_mode, gen_int_mode (value, byte_mode),
+                                   i == 0 ? align : 1);
          }
        break;
       }

Reply via email to