On Thu, Sep 03, 2020 at 07:03:53AM -0300, Alexandre Oliva wrote:
> On Sep  2, 2020, Segher Boessenkool <seg...@kernel.crashing.org> wrote:
> > (And aarch, but not the other targets with default clobbers).
> 
> And arm.  I didn't look very hard for others yet; I wasn't sure it would
> be worth pursuing any further.

Oh, I meant what was in your patch.  From a grep, all of mn10300, nds32,
arm, cris, pdp11, rs6000, i386, visium, aarch64 have default clobbers.
Hrm, what you said was the targets with inline asm flag output
constraints?

> > People write empty asm statements not because they would like no insns
> > emitted from it, but *because* they want the other effects an asm has
> > (for example, an empty asm usually has no outputs, so it is volatile,
> > and that makes sure it is executed in the real machine exactly as often
> > as in the abstract machine).  So your expectation might be wrong,
> > someone might want an empty asm to clobber cc on x86 (like any asm is
> > documented as doing).
> 
> I just can't imagine a case in which flags set by an earlier
> instruction, and that could be usable by a subsequent instruction, could
> possibly be clobbered by an *empty* sequence of insns.  Maybe I just
> lack imagination about all the sorts of weird things that people use
> such asm statements for...  Got any example of something that could be
> broken by such a change to educate me?  Not that I need it, I'm just
> curious.

Before your change, flags could not be live through an asm.  After your
change, it can.  So it does change things...  It seems like this should
never matter, but :-)

> > But how about a "none" clobber?
> 
> I like that!

Okay, sold!  Now to convince everyone else ;-)

> I don't know.  There's a piece of hand-crafted x86 assembly that I'm
> trying to find some way to replicate in compiler-optimized code.  I
> don't know whether two RMWs could possibly be faster, but there are two
> concerns driving me to try to use the same insn sequence:
> 
> - compactness; this is for per-block instrumentation, and if we use 4
>   larger insns instead of 2 smaller ones, there might be even more
>   significant impact on performance because of code cache issues

Yes, it *is* smaller code, certainly.  As I said before, GCC does not
generate very good code with RMWs :-/

> >> If we retried the 3-insn substitution after substituting the flag store
> >> into the add for the adc,
> 
> > combine should retry every combination if any of the input insns to it
> > have changed (put another way, if any insn is changed all combinations
> > with it are tried anew).
> 
> Alas, they don't change, it's an intervening flag-store that does.  When
> we try the 3 insns that make up RMW, while the flag store is still there
> before the W, it triggers one of the cant_combine tests.  After it's
> gone, we don't reconsider.  Does that count as a bug to be filed?

An enhancement bug, sure...  "missing optimisation".  But this is
expected behaviour at least :-)

> >> we might succeed, but only if we had a pattern
> >> that matched add<mode>3_cc_overflow_1's parallel with the flag-setter as
> >> the second element of the parallel, because that's where combine adds it
> >> to the new i3 pattern, after splitting it out of i2.
> 
> > That sounds like the backend pattern has it wrong then?  There is a
> > canonical order for this?
> 
> Much as I can tell, there isn't, it's just an arbitrary choice of
> backends, some do it one way or the other, and that causes combine to be
> able to perform some combinations but not others.

Hrm, I could swear it was in the documentation somewhere, but I cannot
find it currently.  I can find comments in combine.c that refer to this
though, like
  /* Many machines that don't use CC0 have insns that can both perform an
     arithmetic operation and set the condition code.  These operations will
     be represented as a PARALLEL with the first element of the vector
     being a COMPARE of an arithmetic operation with the constant zero.
     The second element of the vector will set some pseudo to the result
     of the same arithmetic operation.

Ah found it (md.texi, @cindex @code{compare}, canonicalization of):

@item
For instructions that inherently set a condition code register, the
@code{compare} operator is always written as the first RTL expression of
the @code{parallel} instruction pattern.  For example,

@smallexample
(define_insn ""
  [(set (reg:CCZ FLAGS_REG)
        (compare:CCZ
          (plus:SI
            (match_operand:SI 1 "register_operand" "%r")
            (match_operand:SI 2 "register_operand" "r"))
          (const_int 0)))
   (set (match_operand:SI 0 "register_operand" "=r")
        (plus:SI (match_dup 1) (match_dup 2)))]
  ""
  "addl %0, %1, %2")
@end smallexample

Not sure this 100% applies to your case though?

> >> I suppose adding such patterns manually isn't the way to go.  I wonder
> >> if getting recog_for_combine to recognize and reorder PARALLELs
> >> appearing out of order would get too expensive, even if genrecog were to
> >> generate optimized code to try alternate orders in parallels.
> 
> > Very big parallels are used, and trying all orderings would take just a
> > little too much time ;-)
> 
> Oh, I wasn't talking about trying all permutations in recog_for_combine,
> but about generating recog code, for combine to use, that could
> recognize out-of-order top-level parallels, so that they could be
> reordered into something regular recog would recognize.  I suppose this
> might involve computing permutations within genrecog...

In most cases it won't have to try many permutations, but in some it
will have to I think :-(

> Here's a hack
> just to get a clue about extra time.  On amd64, we go from:
> 
> Shared 39391 out of 75775 states by creating 10001 new states, saving 29390
> 
> to:
> 
> Shared 45474 out of 86045 states by creating 11586 new states, saving 33888

That is a lot extra, is it worth it?

> For comparisong purposes, here's what we get permuting up to 8 items in
> a parallel (MAX_PERMUTE bumped up to 8), without involving trailing
> clobbers in the permutations:
> 
> Shared 66206 out of 127097 states by creating 16663 new states, saving 49543
> 
> While this enables permutations among trailing clobbers, also up to the
> 8th item in the parallel (PERMUTE_CLOBBERS 8):
> 
> Shared 71133 out of 140924 states by creating 17047 new states, saving 54086
> 
> While this adds partial removal of clobbers, leaving up to
> 8 of them for matching:
> 
> Shared 86720 out of 193474 states by creating 17483 new states, saving 69237
> 
> 
> What this patchlet does *not* do is to encode the permutations into
> e.g. num_clobbers, so that we could apply them afterwards.  With a
> 32-bit int, we could encode permutations of MAX_PERMUTE <= 8 assuming
> the original num_clobbers is always < 52, or <= 7 if num_clobbers could
> go up to 415.
> 
> 
> Not really tested, other than gathering the state counts above.  Does it
> look promising?

Yes.  But the huge numbers above look scary.  And amd64 isn't the worst
case I think :-(


Segher

Reply via email to