> From: Segher Boessenkool <seg...@kernel.crashing.org> > Date: Tue, 7 Jul 2020 22:23:59 +0200
> Hi! > > On Tue, Jul 07, 2020 at 02:50:09AM +0200, Hans-Peter Nilsson wrote: > > > On Mon, Jul 06, 2020 at 03:11:17AM +0200, Hans-Peter Nilsson wrote: > > > > TL;DR: fixing a misdetection of what is a "simple move". > > > > > > That is not a very correct characterisation of what this does :-) > > > > That's apparently where we completely disagree. :-) > > Well, I wrote that code, I know what is considered "just a move" there. You lost some context: I'm comparing before/after the cc0-conversion for CRIS, where this is a misdetection (a false negative) of a move and causes a performance-regression. > You want to extend that, and that is fine, but this is not a bug. Taken > to the extreme, anything in GCC is completely buggy, because it doesn't > solve world hunger (yet!), following that line of thought. Hyperboles are funny, but only to some extent. > > > > Looking into performace degradation after de-cc0 for CRIS, I > > > > noticed combine behaving badly; it changed a move and a > > > > right-shift into two right-shifts, where the "combined" move was > > > > not eliminated in later passes, and where the deficiency caused > > > > an extra insn in a hot loop: crcu16 (and crcu32) in coremark. > > > > > > > > Before de-cc0, the insns input to combine looked like: > > > > 33: r58:SI=r56:SI 0>>r48:SI > > > > REG_DEAD r56:SI > > > > 35: r37:HI=r58:SI#0 > > > > and after: > > > > 33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;} > > > > REG_DEAD r56:SI > > > > REG_UNUSED dccr:CC > > > > 35: {r37:HI=r58:SI#0;clobber dccr:CC;} > > > > REG_UNUSED dccr:CC ... > > Here, the > > end result is that it *added* an instruction to the hot loop. > > It did not. Combine *never* does that. The "it" refers not just to combine here, but the compilation as a whole. The *end* result is an added instruction, since a move+shift was changed into shift+shift and later passes couldn't eliminate it. Just the thing your is_just_move was supposed to prevent. > > It's a deficiency and it caused a performance regression, can't > > argue with that. > > It did not cause any regression. It can be improved, sure, and thank > you for pointing that out! It caused a regression compared to the cc0 version, per above. > > > (2-2 combinations are helpful on single-scalar and even > > > in-order machines as well, btw). > > > > I certainly don't contest that the move can be eliminated, and > > that most cost-effective 2-2 eliminations are helpful. (See my > > other post about combine being eager with allowing same-cost > > combinations.) > > I did not see that post, do you have a pointer? https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549416.html > > > > At first I thought this was due to parallel-ignorant old code > > > > but the "guilty" change is actually pretty recent. Regarding a > > > > parallel with a clobber not being "just" a move, there's only > > > > the two adjacent callers seen in the patch (obviously with the > > > > rename), and their use exactly matches to check that the > > > > argument is a single_set which is a move. It's always applied > > > > to an rtx_insn, so I changed the type and name to avoid the > > > > "just" issue. I had to adjust the type when calling single_set. > > > > > > But it is *not* supposed to be the same as single_set. > > > > (I'm not saying that a single_set is a move. But that's > > obvious.) I guess you meant to say that a single_set with a > > general_operand as a source (as in the patch) is not supposed to > > be the same as a move. This is the only place in combine where > > that distinction would be important. Why? > > It used to be that is_just_move was called on new*pat as well, so you > simply *could not* use single_set there (you have no rtx_insn*). Ah, history. Hard to detect deleted code. > single_set also allows other insns, for example, multiple sets! ...where the other sets are unused. Not sure how that kind of thing would get combined here, but if its combined cost is beneficial, it would be a win, I guess. > But > testing shows that this does not actually happen often at all, so we > could use single_set here and not change much at all for most other > targets. > > > I think you're just overconcious about clobbers here. > > It has nothing to do with that. > > > > > I checked the original commit, c4c5ad1d6d1e1e a.k.a r263067 and > > > > it seems parallels-as-sets were just overlooked and that this > > > > > > They were not. It causes regressions. > > > > Do you have some pointers to PR:s or something else backing that > > statement, or is it your work-in-progress hinted below? > > I do not know what "work in progress" you mean? I'm referring to your "I'll rerun some testing to show this. It'll take a while." Are the regressions you refer to above tracked in bugzilla or on some mailing list? (I'm just meaning current code; if there was a historical regression related to this code that'd be interesting too.) > > > That is why it has a different > > > name, not something with "single set". "just_move" isn't a very good > > > name, I couldn't come up with something better, it is a pretty > > > complicated concept :-/ > > > > > > "i2_should_not_be_2_2_combined"? > > > > > > I'll rerun some testing to show this. It'll take a while. > > > > Care to fill in some details on what kind of testing? > > Same as always: I build the toolchain and Linux kernel for 30 different > targets, and compare the output. > > For most things, just looking at the size of the generated binary is > telling (it shows some combinations did or did not happen). For a more > detailed view, you need to use at the actual machine code diffs directly, > which is much more work (sometimes diff of size per function is useful > to see first as well). > > For 2-2, size does *not* usually change, which brings us immediately > into "a lot more work" territory. Oh, and all x86 compilers ICE. If combine only did lower-cost combinations (perhaps with Richard Sandifords lower-size-when-tied suggestion), I guess this wouldn't happen. 0:-) > > > > With this correction in place, the performance degradation > > > > related to de-cc0 of the CRIS port as measured by coremark is > > > > gone and turned into a small win. N.B.: there certainly is a > > > > code difference in other hot functions, and the swing between > > > > different functions is larger than this difference; to be dealt > > > > with separately. > > > > > > > > Tested cris-elf, x86_64-linux, powerpc64le-linux, 2/3 through > > > > aarch64-linux (unexpectedly slow). > > > > > > > > Ok to commit? > > > > > > No, sorry. > > > > Sigh. I'm very interested in what your investigation will show. > > It shows we can change to use single_set here. Did you mean "will show whether" or is it already complete? > > > One thing that could work is allowing (unused) clobbers of fixed > > > registers (like you have here), or maybe some hook is needed to say this > > > register is like a flags register, or similar. That should work for you, > > > and not regress other targets, maybe even help a little? We'll see. > > > > Still, there is already TARGET_FLAGS_REGNUM (a "hooked" > > constant), so I take it you would be happy if we recognize a > > clobber of *just that*, in a parallel? (I'll take care of > > updating tm.texi of course.) > > Most targets do not use cmpelim, and many *can not* define > TARGET_FLAGS_REGNUM, unfortunately. Of course I meant an update where they'd see no difference to previous behavior and not using single_set; only insns where wither the pattern is a SET *or* two-sized parallels with one SET and the other a clobber of TARGET_FLAGS_REGNUM... > I'll review the original patch again, to point out where it still needs > changing. ...but if you're in progress with a single_set variant, I'm all for it. brgds, H-P