Hi!

On Tue, Jul 07, 2020 at 02:50:09AM +0200, Hans-Peter Nilsson wrote:
> > On Mon, Jul 06, 2020 at 03:11:17AM +0200, Hans-Peter Nilsson wrote:
> > > TL;DR: fixing a misdetection of what is a "simple move".
> > 
> > That is not a very correct characterisation of what this does :-)
> 
> That's apparently where we completely disagree. :-)
Well, I wrote that code, I know what is considered "just a move" there.
You want to extend that, and that is fine, but this is not a bug.  Taken
to the extreme, anything in GCC is completely buggy, because it doesn't
solve world hunger (yet!), following that line of thought.

> > > Looking into performace degradation after de-cc0 for CRIS, I
> > > noticed combine behaving badly; it changed a move and a
> > > right-shift into two right-shifts, where the "combined" move was
> > > not eliminated in later passes, and where the deficiency caused
> > > an extra insn in a hot loop: crcu16 (and crcu32) in coremark.
> > > 
> > > Before de-cc0, the insns input to combine looked like:
> > >    33: r58:SI=r56:SI 0>>r48:SI
> > >       REG_DEAD r56:SI
> > >    35: r37:HI=r58:SI#0
> > > and after:
> > >    33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;}
> > >       REG_DEAD r56:SI
> > >       REG_UNUSED dccr:CC
> > >    35: {r37:HI=r58:SI#0;clobber dccr:CC;}
> > >       REG_UNUSED dccr:CC
> > 
> > So a shift like this is at most as expensive as a move, on your target
> > (or, in your backend, anyway ;-) )
> 
> On CRIS, the backend *and* the target, yes; one cycle, one short
> instruction.

So combine did what it is supposed to do.

> > > That is, there's always a parallel with a clobber of the
> > > condition-codes register.  Being a parallel, it's not an
> > > is_just_move, but e.g. a single_set.

This is something that happens on many targets.  For some, only for some
instructions (and flag registers).  For some, for many instructions; and
for really unhappy targets, for almost all instructions, even for
register moves and/or loads and/or stores.

> > > For the de-cc0:ed "combination", it ended up as
> > >    33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;}
> > >       REG_UNUSED dccr:CC
> > >    35: {r37:HI#0=r56:SI 0>>r48:SI;clobber dccr:CC;}
> > >       REG_DEAD r56:SI
> > >       REG_UNUSED dccr:CC
> > > That is, a move and a shift turned into two shifts; the extra
> > > shift is not eliminated by later passes, while the move was
> > > (with cc0, and "will be again") leading to redundant
> > > instructions.
> > 
> > Which was the whole point of the is_just_move() thing, yes.  Combine
> > doesn't know most moves will be eliminated by RA (but many are useful to
> > do have before RA, because it gives RA much more freedom).  If a move is
> > the same cost as a simple insn, doing two (say shift, like here) insns
> > in parallel is cheaper on most machines than having a shift and a move
> > sequentially.
> 
> Most parallel machines you mean,

No, I mean most machines, not just super-scalar machines.

> but why bring up them when
> there's no means for combine to tell the difference?

I did not bring them up, and this optimisation is important not only for
super-scalar targets.  But, the most important targets for GCC (in terms
of how many people use it, all targets are important for many other
reasons) are all supper-scalar.

> Here, the
> end result is that it *added* an instruction to the hot loop.

It did not.  Combine *never* does that.

> It's a deficiency and it caused a performance regression, can't
> argue with that.

It did not cause any regression.  It can be improved, sure, and thank
you for pointing that out!

> >  (2-2 combinations are helpful on single-scalar and even
> > in-order machines as well, btw).
> 
> I certainly don't contest that the move can be eliminated, and
> that most cost-effective 2-2 eliminations are helpful.  (See my
> other post about combine being eager with allowing same-cost
> combinations.)

I did not see that post, do you have a pointer?

> > > At first I thought this was due to parallel-ignorant old code
> > > but the "guilty" change is actually pretty recent.  Regarding a
> > > parallel with a clobber not being "just" a move, there's only
> > > the two adjacent callers seen in the patch (obviously with the
> > > rename), and their use exactly matches to check that the
> > > argument is a single_set which is a move.  It's always applied
> > > to an rtx_insn, so I changed the type and name to avoid the
> > > "just" issue.  I had to adjust the type when calling single_set.
> > 
> > But it is *not* supposed to be the same as single_set.
> 
> (I'm not saying that a single_set is a move.  But that's
> obvious.)  I guess you meant to say that a single_set with a
> general_operand as a source (as in the patch) is not supposed to
> be the same as a move.  This is the only place in combine where
> that distinction would be important.  Why?

It used to be that is_just_move was called on new*pat as well, so you
simply *could not* use single_set there (you have no rtx_insn*).

single_set also allows other insns, for example, multiple sets!  But
testing shows that this does not actually happen often at all, so we
could use single_set here and not change much at all for most other
targets.

> I think you're just overconcious about clobbers here.

It has nothing to do with that.

> > > I checked the original commit, c4c5ad1d6d1e1e a.k.a r263067 and
> > > it seems parallels-as-sets were just overlooked and that this
> > 
> > They were not.  It causes regressions.
> 
> Do you have some pointers to PR:s or something else backing that
> statement, or is it your work-in-progress hinted below?

I do not know what "work in progress" you mean?

> >  That is why it has a different
> > name, not something with "single set".  "just_move" isn't a very good
> > name, I couldn't come up with something better, it is a pretty
> > complicated concept :-/
> > 
> > "i2_should_not_be_2_2_combined"?
> > 
> > I'll rerun some testing to show this.  It'll take a while.
> 
> Care to fill in some details on what kind of testing?

Same as always: I build the toolchain and Linux kernel for 30 different
targets, and compare the output.

For most things, just looking at the size of the generated binary is
telling (it shows some combinations did or did not happen).  For a more
detailed view, you need to use at the actual machine code diffs directly,
which is much more work (sometimes diff of size per function is useful
to see first as well).

For 2-2, size does *not* usually change, which brings us immediately
into "a lot more work" territory.  Oh, and all x86 compilers ICE.

> > > With this correction in place, the performance degradation
> > > related to de-cc0 of the CRIS port as measured by coremark is
> > > gone and turned into a small win.  N.B.: there certainly is a
> > > code difference in other hot functions, and the swing between
> > > different functions is larger than this difference; to be dealt
> > > with separately.
> > > 
> > > Tested cris-elf, x86_64-linux, powerpc64le-linux, 2/3 through
> > > aarch64-linux (unexpectedly slow).
> > > 
> > > Ok to commit?
> > 
> > No, sorry.
> 
> Sigh.  I'm very interested in what your investigation will show.

It shows we can change to use single_set here.

> > One thing that could work is allowing (unused) clobbers of fixed
> > registers (like you have here), or maybe some hook is needed to say this
> > register is like a flags register, or similar.  That should work for you,
> > and not regress other targets, maybe even help a little?  We'll see.
> 
> Still, there is already TARGET_FLAGS_REGNUM (a "hooked"
> constant), so I take it you would be happy if we recognize a
> clobber of *just that*, in a parallel?  (I'll take care of
> updating tm.texi of course.)

Most targets do not use cmpelim, and many *can not* define
TARGET_FLAGS_REGNUM, unfortunately.

I'll review the original patch again, to point out where it still needs
changing.

Thanks,


Segher

Reply via email to