Re: [PATCH] simplify-rtx: Limit number of elts in when encoding.

Richard Sandiford Tue, 24 Dec 2024 04:23:11 -0800

"Robin Dapp" <rdapp....@gmail.com> writes:
> [PATCH] varasm: Use native_encode_rtx for constant vectors.
>
> optimize_constant_pool hashes vector masks by native_encode_rtx and
> merges identically hashed values in the constant pool.  Afterwards the
> optimized values are written in output_constant_pool_2.
>
> However, native_encode_rtx and output_constant_pool_2 disagree in their
> encoding of vector masks:  native_encode_rtx does not pad with zeroes
> while output_constant_pool_2 implicitly does.
>
> In RVV's shuffle-evenodd-run.c there are two masks
>   (a) "0101" for V4BI
>   (b) "01010101" for V8BI and
> that have the same representation/encoding ("1010101") in native_encode_rtx.
> output_constant_pool_2 uses "101" for (a) and "1010101" for (b).
>
> Now, optimize_constant_pool might happen to merge both masks using
> (a) as representative.  Then, output_constant_pool_2 will output "1010"
> which is only valid for the second mask as the implicit zero padding
> doesn't agree with (b).
> (b)'s "1010101" works for both masks as a V4BI load will ignore the last four
> padding bits.
>
> This patch makes output_constant_pool_2 use native_encode_rtx so both
> functions will agree on an encoding and output the correct constant.
>
> gcc/ChangeLog:
>
>       * varasm.cc (output_constant_pool_2): Use native_encode_rtx for
>       building the memory image of a const vector mask.
> ---
>  gcc/varasm.cc | 32 ++++++++++++--------------------
>  1 file changed, 12 insertions(+), 20 deletions(-)


Thanks for trying it this way.

> diff --git a/gcc/varasm.cc b/gcc/varasm.cc
> index 0068ec2ce4d..584d1864077 100644
> --- a/gcc/varasm.cc
> +++ b/gcc/varasm.cc
> @@ -4301,34 +4301,26 @@ output_constant_pool_2 (fixed_size_mode mode, rtx x, 
> unsigned int align)
>        {
>       gcc_assert (GET_CODE (x) == CONST_VECTOR);
>  
> -     /* Pick the smallest integer mode that contains at least one
> -        whole element.  Often this is byte_mode and contains more
> -        than one element.  */
> +     auto_vec<target_unit, 128> buffer;
> +     buffer.truncate (0);
> +     buffer.reserve (GET_MODE_SIZE (mode));
> +
> +     bool ok = native_encode_rtx (mode, x, buffer, 0, GET_MODE_SIZE (mode));
> +     gcc_assert (ok);
> +

This part looks good.

>       unsigned int nelts = GET_MODE_NUNITS (mode);
>       unsigned int elt_bits = GET_MODE_PRECISION (mode) / nelts;
>       unsigned int int_bits = MAX (elt_bits, BITS_PER_UNIT);
>       scalar_int_mode int_mode = int_mode_for_size (int_bits, 0).require ();
> -     unsigned int mask = GET_MODE_MASK (GET_MODE_INNER (mode));
>  
> -     /* We allow GET_MODE_PRECISION (mode) <= GET_MODE_BITSIZE (mode) but
> -        only properly handle cases where the difference is less than a
> -        byte.  */
> -     gcc_assert (GET_MODE_BITSIZE (mode) - GET_MODE_PRECISION (mode) <
> -                 BITS_PER_UNIT);
> -
> -     /* Build the constant up one integer at a time.  */
> -     unsigned int elts_per_int = int_bits / elt_bits;
> -     for (unsigned int i = 0; i < nelts; i += elts_per_int)
> +     for (unsigned i = 0;
> +          i < GET_MODE_SIZE (mode) / GET_MODE_SIZE (int_mode);
> +          i += GET_MODE_SIZE (int_mode))
>         {
>           unsigned HOST_WIDE_INT value = 0;
> -         unsigned int limit = MIN (nelts - i, elts_per_int);
> -         for (unsigned int j = 0; j < limit; ++j)
> -         {
> -           auto elt = INTVAL (CONST_VECTOR_ELT (x, i + j));
> -           value |= (elt & mask) << (j * elt_bits);
> -         }
> +         memcpy (&value, buffer.address () + i, GET_MODE_SIZE (int_mode));
>           output_constant_pool_2 (int_mode, gen_int_mode (value, int_mode),
> -                                 i != 0 ? MIN (align, int_bits) : align);
> +                                 align);

But this part doesn't look endian-safe, since it mixes tha target
endianness of the encoded buffer with the host endianness of "value"
(via the the memcpy).

If we wanted to contiue generating the constant in int_mode, we'd need
to decode sections of the buffer back to int_mode first.  That doesn't
seem worth the effort though.

How about just using the global byte_mode and iterating over every
element of the buffer?  The alignment would be "align" for i == 0 and
1 otherwise.

That might require some scan-assembler updates, but it seems worth it.
I can test for aarch64 testsuite updates next week if that'd help.

Thanks,
Richard

Re: [PATCH] simplify-rtx: Limit number of elts in when encoding.

Reply via email to