On Thu, Mar 09, 2023 at 12:10:51PM +0000, Richard Sandiford wrote:
> g:c23a9c87cc62bd177fd0d4db6ad34b34e1b9a31f uses nonzero_bits
> information to convert sign_extends into zero_extends.
> That change is semantically correct in itself, but for the
> testcase in the PR, it leads to a series of unfortunate events,
> as described below.

That patch also does other much more obviously beneficial transforms.
These things should not have been mashed together.  Smaller more atomic
patches are much nicer to handle :-/

> We try to combine:
> 
> Trying 24 -> 25:
>    24: r116:DI=sign_extend(r115:SI)
>       REG_DEAD r115:SI
>    25: r117:SI=[r116:DI*0x4+r118:DI]
>       REG_DEAD r116:DI
>       REG_EQUAL [r116:DI*0x4+`constellation_64qam']
> 
> which previously succeeded, giving:
> 
> (set (reg:SI 117 [ constellation_64qam[_5] ])
>     (mem/u:SI (plus:DI (mult:DI (sign_extend:DI (reg:SI 115))
>                 (const_int 4 [0x4]))
>             (reg/f:DI 118)) [1 constellation_64qam[_5]+0 S4 A32]))
> 
> However, nonzero_bits can tell that only the low 6 bits of r115
> can be nonzero.  The commit above therefore converts the sign_extend
> to a zero_extend.  Using the same nonzero_bits information, we then
> "expand" the zero_extend to:
> 
>   (and:DI (subreg:DI (reg:SI r115) 0)
>           (const_int 63))

This is more expensive already?!  An "and" usually costs a real machine
instruction, while subregs are usually free (just notational).  The
"and" can not easily be optimised away either (after combine we only
have less accurate nonzero_bits, to start with).

> Substituting into the mult gives the unsimplified expression:
> 
>   (mult:DI (and:DI (subreg:DI (reg:SI r115) 0)
>                    (const_int 63))
>            (const_int 4))
> 
> The simplification rules for mult convert this to an ashift by 2.
> Then, this rule in simplify_shift_const_1:
> 
>         /* If we have (shift (logical)), move the logical to the outside
>            to allow it to possibly combine with another logical and the
>            shift to combine with another shift.  This also canonicalizes to
>            what a ZERO_EXTRACT looks like.  Also, some machines have
>            (and (shift)) insns.  */
> 
> moves the shift inside the "and", so that the expression becomes:
> 
>   (and:DI (ashift:DI (subreg:DI (reg:SI r115) 0)
>                      (const_int 2))
>           (const_int 252))
> 
> We later recanonicalise to a mult (since this is an address):
> 
>   (and:DI (mult:DI (subreg:DI (reg:SI r115) 0)
>                    (const_int 4))
>           (const_int 252))
> 
> But we fail to transform this back to the natural substitution:
> 
>   (mult:DI (zero_extend:DI (reg:SI r115))
>            (const_int 4))
> 
> There are several other cases in which make_compound_operation
> needs to look more than one level down in order to complete a
> compound operation.  For example:
> 
> (a) the ashiftrt handling uses extract_left_shift to look through
>     things like logic ops in order to find a partnering ashift
>     operation
> 
> (b) the "and" handling looks through subregs, xors and iors
>     to find a partnerning lshiftrt
> 
> This patch takes the same approach for mult.
> 
> gcc/
>       PR rtl-optimization/106594
>       * combine.cc (make_compound_operation_and): Look through
>       multiplications by a power of two.
> 
> gcc/testsuite/
>       * gcc.target/aarch64/pr106594.c: New test.
> ---
>  gcc/combine.cc                              | 17 +++++++++++++++++
>  gcc/testsuite/gcc.target/aarch64/pr106594.c | 20 ++++++++++++++++++++
>  2 files changed, 37 insertions(+)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594.c
> 
> diff --git a/gcc/combine.cc b/gcc/combine.cc
> index 7d446d02cb4..36d04ad6703 100644
> --- a/gcc/combine.cc
> +++ b/gcc/combine.cc
> @@ -7996,6 +7996,23 @@ make_compound_operation_and (scalar_int_mode mode, rtx 
> x,
>       break;
>        }
>  
> +    case MULT:
> +      /* Recurse through a power of 2 multiplication (as can be found
> +      in an address), using the relationship:
> +
> +      (and (mult X 2**N1) N2) == (mult (and X (lshifrt N2 N1)) 2**N1).  */
> +      if (CONST_INT_P (XEXP (x, 1))
> +       && pow2p_hwi (INTVAL (XEXP (x, 1))))
> +     {
> +       int shift = exact_log2 (INTVAL (XEXP (x, 1)));
> +       rtx sub = make_compound_operation_and (mode, XEXP (x, 0),
> +                                              mask >> shift, in_code,
> +                                              next_code);
> +       if (sub)
> +         return gen_rtx_MULT (mode, sub, XEXP (x, 1));
> +     }
> +      break;
> +
>      default:
>        break;
>      }
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594.c 
> b/gcc/testsuite/gcc.target/aarch64/pr106594.c
> new file mode 100644
> index 00000000000..beda8e050a5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr106594.c
> @@ -0,0 +1,20 @@
> +/* { dg-options "-O2" } */
> +
> +extern const int constellation_64qam[64];
> +
> +void foo(int nbits,
> +         const char *p_src,
> +         int *p_dst) {
> +
> +  while (nbits > 0U) {
> +    char first = *p_src++;
> +
> +    char index1 = ((first & 0x3) << 4) | (first >> 4);
> +
> +    *p_dst++ = constellation_64qam[index1];
> +
> +    nbits--;
> +  }
> +}
> +
> +/* { dg-final { scan-assembler {ldr\tw[0-9]+, \[x[0-9]+, w[0-9]+, [su]xtw 
> #?2\]} } } */

This looks pretty good (thanks!), but as always it needs testing on more
architectures, showing it doesn't hurt things.  It should be beneficial,
but it is not unlikely to hurt other existing backends, and we are not
in stage 1 (we are in stage 4 even!)

Do you have any such proof / indication / anything?  I can start
some run but it takes a day (or two), and I cannot start it until next
week.

Do you have plans to make combine not do this insane "and" thing at all?
Or to attack the compound operation stuff head on?


Segher

Reply via email to