> From: Segher Boessenkool <seg...@kernel.crashing.org> > Date: Tue, 7 Jul 2020 01:42:47 +0200
TL;DR: recognize a parallel with a clobber of TARGET_FLAGS_REGNUM? > Hi! > > On Mon, Jul 06, 2020 at 03:11:17AM +0200, Hans-Peter Nilsson wrote: > > TL;DR: fixing a misdetection of what is a "simple move". > > That is not a very correct characterisation of what this does :-) That's apparently where we completely disagree. :-) > > Looking into performace degradation after de-cc0 for CRIS, I > > noticed combine behaving badly; it changed a move and a > > right-shift into two right-shifts, where the "combined" move was > > not eliminated in later passes, and where the deficiency caused > > an extra insn in a hot loop: crcu16 (and crcu32) in coremark. > > > > Before de-cc0, the insns input to combine looked like: > > 33: r58:SI=r56:SI 0>>r48:SI > > REG_DEAD r56:SI > > 35: r37:HI=r58:SI#0 > > and after: > > 33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;} > > REG_DEAD r56:SI > > REG_UNUSED dccr:CC > > 35: {r37:HI=r58:SI#0;clobber dccr:CC;} > > REG_UNUSED dccr:CC > > So a shift like this is at most as expensive as a move, on your target > (or, in your backend, anyway ;-) ) On CRIS, the backend *and* the target, yes; one cycle, one short instruction. > > That is, there's always a parallel with a clobber of the > > condition-codes register. Being a parallel, it's not an > > is_just_move, but e.g. a single_set. > > > > For the de-cc0:ed "combination", it ended up as > > 33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;} > > REG_UNUSED dccr:CC > > 35: {r37:HI#0=r56:SI 0>>r48:SI;clobber dccr:CC;} > > REG_DEAD r56:SI > > REG_UNUSED dccr:CC > > That is, a move and a shift turned into two shifts; the extra > > shift is not eliminated by later passes, while the move was > > (with cc0, and "will be again") leading to redundant > > instructions. > > Which was the whole point of the is_just_move() thing, yes. Combine > doesn't know most moves will be eliminated by RA (but many are useful to > do have before RA, because it gives RA much more freedom). If a move is > the same cost as a simple insn, doing two (say shift, like here) insns > in parallel is cheaper on most machines than having a shift and a move > sequentially. Most parallel machines you mean, but why bring up them when there's no means for combine to tell the difference? Here, the end result is that it *added* an instruction to the hot loop. It's a deficiency and it caused a performance regression, can't argue with that. > (2-2 combinations are helpful on single-scalar and even > in-order machines as well, btw). I certainly don't contest that the move can be eliminated, and that most cost-effective 2-2 eliminations are helpful. (See my other post about combine being eager with allowing same-cost combinations.) > > At first I thought this was due to parallel-ignorant old code > > but the "guilty" change is actually pretty recent. Regarding a > > parallel with a clobber not being "just" a move, there's only > > the two adjacent callers seen in the patch (obviously with the > > rename), and their use exactly matches to check that the > > argument is a single_set which is a move. It's always applied > > to an rtx_insn, so I changed the type and name to avoid the > > "just" issue. I had to adjust the type when calling single_set. > > But it is *not* supposed to be the same as single_set. (I'm not saying that a single_set is a move. But that's obvious.) I guess you meant to say that a single_set with a general_operand as a source (as in the patch) is not supposed to be the same as a move. This is the only place in combine where that distinction would be important. Why? I think you're just overconcious about clobbers here. The combine pass throws and adds them as it pleases in other places, so I'm not sure what your worries are in *this* place. (It collects them and adds them to the combinations.) > > I checked the original commit, c4c5ad1d6d1e1e a.k.a r263067 and > > it seems parallels-as-sets were just overlooked and that this > > They were not. It causes regressions. Do you have some pointers to PR:s or something else backing that statement, or is it your work-in-progress hinted below? > That is why it has a different > name, not something with "single set". "just_move" isn't a very good > name, I couldn't come up with something better, it is a pretty > complicated concept :-/ > > "i2_should_not_be_2_2_combined"? > > I'll rerun some testing to show this. It'll take a while. Care to fill in some details on what kind of testing? > > patch appears to agree with the intent and the comment at the > > use of i2_was_move and i3_was_move, which has a clause saying > > "Also do this if we started with two insns neither of which was > > a simple move". > > That says that we combine to 2 insns also if we started with only 2, > but neither of those was just a move. That's what I'm saying too. > > With this correction in place, the performance degradation > > related to de-cc0 of the CRIS port as measured by coremark is > > gone and turned into a small win. N.B.: there certainly is a > > code difference in other hot functions, and the swing between > > different functions is larger than this difference; to be dealt > > with separately. > > > > Tested cris-elf, x86_64-linux, powerpc64le-linux, 2/3 through > > aarch64-linux (unexpectedly slow). > > > > Ok to commit? > > No, sorry. Sigh. I'm very interested in what your investigation will show. > One thing that could work is allowing (unused) clobbers of fixed > registers (like you have here), or maybe some hook is needed to say this > register is like a flags register, or similar. That should work for you, > and not regress other targets, maybe even help a little? We'll see. Still, there is already TARGET_FLAGS_REGNUM (a "hooked" constant), so I take it you would be happy if we recognize a clobber of *just that*, in a parallel? (I'll take care of updating tm.texi of course.) brgds, H-P