Segher Boessenkool <seg...@kernel.crashing.org> writes: > On Thu, Mar 09, 2023 at 12:10:51PM +0000, Richard Sandiford wrote: >> g:c23a9c87cc62bd177fd0d4db6ad34b34e1b9a31f uses nonzero_bits >> information to convert sign_extends into zero_extends. >> That change is semantically correct in itself, but for the >> testcase in the PR, it leads to a series of unfortunate events, >> as described below. > > That patch also does other much more obviously beneficial transforms. > These things should not have been mashed together. Smaller more atomic > patches are much nicer to handle :-/ > >> We try to combine: >> >> Trying 24 -> 25: >> 24: r116:DI=sign_extend(r115:SI) >> REG_DEAD r115:SI >> 25: r117:SI=[r116:DI*0x4+r118:DI] >> REG_DEAD r116:DI >> REG_EQUAL [r116:DI*0x4+`constellation_64qam'] >> >> which previously succeeded, giving: >> >> (set (reg:SI 117 [ constellation_64qam[_5] ]) >> (mem/u:SI (plus:DI (mult:DI (sign_extend:DI (reg:SI 115)) >> (const_int 4 [0x4])) >> (reg/f:DI 118)) [1 constellation_64qam[_5]+0 S4 A32])) >> >> However, nonzero_bits can tell that only the low 6 bits of r115 >> can be nonzero. The commit above therefore converts the sign_extend >> to a zero_extend. Using the same nonzero_bits information, we then >> "expand" the zero_extend to: >> >> (and:DI (subreg:DI (reg:SI r115) 0) >> (const_int 63)) > > This is more expensive already?! An "and" usually costs a real machine > instruction, while subregs are usually free (just notational). The > "and" can not easily be optimised away either (after combine we only > have less accurate nonzero_bits, to start with).
Well, it's an "and" in place of a zero_extend rather than an "and" in place of a subreg. So it is one real instruction for another real instruction. But yeah, it's not exactly a simplification. The "we" here is expand_compound_operation, via simplify_and_const_int. So it's the usual thing: expand_compound_operation creates something that is perhaps more expensive, then after simplification, make_compound_operation is supposed to collapse these "expanded" perhaps-more-expensive forms back. And it's the last bit that goes wrong. >> Substituting into the mult gives the unsimplified expression: >> >> (mult:DI (and:DI (subreg:DI (reg:SI r115) 0) >> (const_int 63)) >> (const_int 4)) >> >> The simplification rules for mult convert this to an ashift by 2. >> Then, this rule in simplify_shift_const_1: >> >> /* If we have (shift (logical)), move the logical to the outside >> to allow it to possibly combine with another logical and the >> shift to combine with another shift. This also canonicalizes to >> what a ZERO_EXTRACT looks like. Also, some machines have >> (and (shift)) insns. */ >> >> moves the shift inside the "and", so that the expression becomes: >> >> (and:DI (ashift:DI (subreg:DI (reg:SI r115) 0) >> (const_int 2)) >> (const_int 252)) >> >> We later recanonicalise to a mult (since this is an address): >> >> (and:DI (mult:DI (subreg:DI (reg:SI r115) 0) >> (const_int 4)) >> (const_int 252)) >> >> But we fail to transform this back to the natural substitution: >> >> (mult:DI (zero_extend:DI (reg:SI r115)) >> (const_int 4)) >> >> There are several other cases in which make_compound_operation >> needs to look more than one level down in order to complete a >> compound operation. For example: >> >> (a) the ashiftrt handling uses extract_left_shift to look through >> things like logic ops in order to find a partnering ashift >> operation >> >> (b) the "and" handling looks through subregs, xors and iors >> to find a partnerning lshiftrt >> >> This patch takes the same approach for mult. >> >> gcc/ >> PR rtl-optimization/106594 >> * combine.cc (make_compound_operation_and): Look through >> multiplications by a power of two. >> >> gcc/testsuite/ >> * gcc.target/aarch64/pr106594.c: New test. >> --- >> gcc/combine.cc | 17 +++++++++++++++++ >> gcc/testsuite/gcc.target/aarch64/pr106594.c | 20 ++++++++++++++++++++ >> 2 files changed, 37 insertions(+) >> create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594.c >> >> diff --git a/gcc/combine.cc b/gcc/combine.cc >> index 7d446d02cb4..36d04ad6703 100644 >> --- a/gcc/combine.cc >> +++ b/gcc/combine.cc >> @@ -7996,6 +7996,23 @@ make_compound_operation_and (scalar_int_mode mode, >> rtx x, >> break; >> } >> >> + case MULT: >> + /* Recurse through a power of 2 multiplication (as can be found >> + in an address), using the relationship: >> + >> + (and (mult X 2**N1) N2) == (mult (and X (lshifrt N2 N1)) 2**N1). */ >> + if (CONST_INT_P (XEXP (x, 1)) >> + && pow2p_hwi (INTVAL (XEXP (x, 1)))) >> + { >> + int shift = exact_log2 (INTVAL (XEXP (x, 1))); >> + rtx sub = make_compound_operation_and (mode, XEXP (x, 0), >> + mask >> shift, in_code, >> + next_code); >> + if (sub) >> + return gen_rtx_MULT (mode, sub, XEXP (x, 1)); >> + } >> + break; >> + >> default: >> break; >> } >> diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594.c >> b/gcc/testsuite/gcc.target/aarch64/pr106594.c >> new file mode 100644 >> index 00000000000..beda8e050a5 >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/aarch64/pr106594.c >> @@ -0,0 +1,20 @@ >> +/* { dg-options "-O2" } */ >> + >> +extern const int constellation_64qam[64]; >> + >> +void foo(int nbits, >> + const char *p_src, >> + int *p_dst) { >> + >> + while (nbits > 0U) { >> + char first = *p_src++; >> + >> + char index1 = ((first & 0x3) << 4) | (first >> 4); >> + >> + *p_dst++ = constellation_64qam[index1]; >> + >> + nbits--; >> + } >> +} >> + >> +/* { dg-final { scan-assembler {ldr\tw[0-9]+, \[x[0-9]+, w[0-9]+, [su]xtw >> #?2\]} } } */ > > This looks pretty good (thanks!), but as always it needs testing on more > architectures, showing it doesn't hurt things. It should be beneficial, > but it is not unlikely to hurt other existing backends, and we are not > in stage 1 (we are in stage 4 even!) Yeah, but it's fixing a GCC 13 regression (and an important one on aarch64). > Do you have any such proof / indication / anything? I can start > some run but it takes a day (or two), and I cannot start it until next > week. This is an alternative presentation of the change that we discussed a few weeks ago, and that you already tested: https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613486.html The results seem to indicate that the patch had no effect on targets other than aarch64. [A good thing, IMO. :-) The purpose of the patch is to fix the aarch64 regression in a minimally invasive way.] I tried building an aarch64 linux kernel locally with -Os -fno-schedule-insns{,2}. I saw code size improvements in 182 .os and a regression in only one .o. (I was comparing individual .os because it makes isolation easier.) The one regression was because the optimised version had fewer pseudos, and so something that was previously allocated to x3 was allocated to x2. This pseudo was initialised to 0, and a preceding stack protect instruction had the known side effect (both before and after the patch) of setting x3 to 0. So with the x3 allocation, postreload was able to remove the separate initialisation of x3 with 0, but with the x2 allocation it couldn't. So for my kernel build it was a 182:1 split in favour of improvements, with the one regression being a quirk rather than a true regression. I also tried seeing what effect it had for aarch64-linux-gnu on gcc.c-torture, gcc.dg and g++.dg. All the changes were improvements here too. (The point of using these testsuites are that they contain small, naturally isolated tests that can be quickly compared en masse, rather than because the code is representative of a real workload.) ArmRAL (the workload that showed up the problem originally) also improved, as expected. I'm confident that this an improvement on aarch64. > Do you have plans to make combine not do this insane "and" thing at all? > Or to attack the compound operation stuff head on? I don't have any plans to do that myself, but I agree it would be better to get rid of the compound operation stuff if we can. I can see why the expand/simplify/remake process seemed like a nice idea, but in practice, there's just too much that can go wrong. And removing the compound stuff would go a long way to making combine simplification more like fwprop simpliification (unlike now, where we effectively have two separate simplification processes for things involving *_extends and *_extracts). Thanks, Richard