Re: [RFC] PR tree-optimization/67628: Make tree ifcombine more symmetric and interactions with dom

Kyrill Tkachov Fri, 25 Sep 2015 05:22:21 -0700


On 25/09/15 12:35, Richard Biener wrote:

On Fri, 25 Sep 2015, Kyrill Tkachov wrote:

On 25/09/15 11:15, Richard Biener wrote:

On Fri, 25 Sep 2015, Kyrill Tkachov wrote:

On 25/09/15 10:49, Richard Biener wrote:

On Fri, 25 Sep 2015, Kyrill Tkachov wrote:

Hi all,

On 23/09/15 11:10, Richard Biener wrote:

On Wed, 23 Sep 2015, Kyrill Tkachov wrote:

On 23/09/15 10:09, Pinski, Andrew wrote:

On Sep 23, 2015, at 1:59 AM, Kyrill Tkachov
<kyrylo.tkac...@arm.com>
wrote:

On 22/09/15 20:31, Jeff Law wrote:

On 09/22/2015 07:36 AM, Kyrill Tkachov wrote:
Hi all,
Unfortunately, I see a testsuite regression with this
patch:
FAIL: gcc.dg/pr66299-2.c scan-tree-dump-not optimized "<<"

The reduced part of that test is:
void
test1 (int x, unsigned u)
{
        if ((1U << x) != 64
            || (2 << x) != u
            || (x << x) != 384
            || (3 << x) == 9
            || (x << 14) != 98304U
            || (1 << x) == 14
            || (3 << 2) != 12)
          __builtin_abort ();
}

The patched ifcombine pass works more or less as expected
and
produces
fewer basic blocks.
Before this patch a relevant part of the ifcombine dump
for
test1
is:
;;   basic block 2, loop depth 0, count 0, freq 10000,
maybe
hot
        if (x_1(D) != 6)
          goto <bb 6>;
        else
          goto <bb 3>;

;;   basic block 3, loop depth 0, count 0, freq 9996,
maybe
hot
        _2 = 2 << x_1(D);
        _3 = (unsigned intD.10) _2;
        if (_3 != u_4(D))
          goto <bb 6>;
        else
          goto <bb 4>;


After this patch it is:
;;   basic block 2, loop depth 0, count 0, freq 10000,
maybe
hot
        _2 = 2 << x_1(D);
        _3 = (unsigned intD.10) _2;
        _9 = _3 != u_4(D);
        _10 = x_1(D) != 6;
        _11 = _9 | _10;
        if (_11 != 0)
          goto <bb 5>;
        else
          goto <bb 3>;

The second form ends up generating worse codegen however,
and
the
badness starts with the dom1 pass.
In the unpatched case it manages to deduce that x must be
6 by
the
time
it reaches basic block 3 and
uses that information to eliminate the shift in "_2 = 2 <<
x_1(D)"
from
basic block 3
In the patched case it is unable to make that call, I
think
because
the
x != 6 condition is IORed
with another test.

I'm not familiar with the internals of the dom pass, so
I'm
not
sure
where to go looking for a fix for this.
Is the ifcombine change a step in the right direction? If
so,
what
would
need to be done to fix the issue with
the dom pass?

I don't see how you can reasonably fix this in DOM.  if _9
or
_10 is
true, then _11 is true.  But we can't reasonably record any
kind
of
equivalence for _9 or _10 individually.

If the statement
_11 = _9 | _10;

Were changed to

_11 = _9 & _10;

Then we could record something useful about _9 and _10.

I suppose what we want is to not combine basic blocks if
the
sequence
and conditions of the basic blocks are
such that dom can potentially exploit them, but how do we
express
that?

I don't think there's going to be a way to directly express
that.
You
could essentially claim that TRUTH_OR is more expensive than
TRUTH_AND
because of the impact on DOM, but that in and of itself may
not
resolve
the situation either.

I think the question we need to answer is whether or not
your
changes
are generally better, even if there's specific instances
where
they
make
things worse.  If the benefits outweigh the negatives then
we
can
xfail
that test.

Ok, I'll investigate and benchmark some more.
Andrew, this transformation to ifcombine (together with the
restriction
that the inner condition block
has to be a single comparison) was added by you with r204194.
Is there a particular reason for that restriction and why it
is
applied to
the inner block and not either?

My reasoning at the time was there might be an "expensive"
instruction
or
one that might trap (I did not check to see if the other part of
the
code
was detecting that).
The outer block did not need any checks as we have something
like
...
If (a)
       If (b)

Or
....
If (a)
       Goto f
else if (b)
      ....
Else
{
F:
....
}

And there was no need to check what was before the if (a) part
just
what
is
in between the two ifs.

Ah, because the code in outer_cond_bb would have to be executed
anyway
whether
we perform the conversion or not, right?

All ifcombine transforms make the outer condition unconditionally
true/false thus the check should have been on whether the outer
cond BB is "empty".  Which would solve your problem, right?

Note that other transforms (bit test recognition) don't care (sth
we might want to fix?).

In general this needs a better cost function, maybe simply use
estimate_num_insns with speed estimates and compare against a
new --param.

So I've played around with that code and I think I'd like to make it a
bit
more intricate. Just comparing against estimate_num_insns is too
coarse-grained and
is just a comparison by a magic number number and I've been struggling
to
make
this
aggressive enough without pulling in too much code into the
unconditional
path.

As far as aarch64 is concerned I want to make this transformation more
aggressive when
it can facilitate conditional comparison generation during expand.
This
means
that I want
to allow the cases where the inner block contains comparisons,
combined
using
logical operations
like TRUTH_AND_EXPR, TRUTH_IOR_EXPR, TRUTH_NOT_EXPR, their bitwise
variants
etc.
The expand code in ccmp.c knows how to handle these chains of
comparisons
and
emit the appropriate
conditional compare patterns. If, however, the inner block contains
other
types of operations like
arithmetic ops, pointer dereferencing etc I'd want to be conservative
to
avoid
pulling in operations
that don't facilitate the ccmp.c expansions.

So the inner block only contains stmts feeding a condition?  As we're
combining that condition with the outer one and the result simplified(?)
it makes sense to allow that as "empty" block generally.

My concern is that those stmts feeding the condition will
now be executed unconditionally (rather than only after jumping
to the inner_cond_bb) so we might end up doing redundant work
if the the outer condition would actually have jumped to the else block
rather than to the inner one.

That's true.  But how is arm not affected by this?  That is,
what's so special about ccmps here?

arm doesn't implement the TARGET_GEN_CCMP_FIRST hook and thus does not
make use of the ccmp.c code that transforms ANDs and IORs of comparisons into
ccmp chains.
  In principle it could (since it can just use normal conditional execution)
but we'd need to implement that hook for the arm backend, which is a separate
task.

If/when we implement the ccmp hooks for arm it should automatically benefit
from a more
aggressive ifcombine with this scheme but I'm not entirely confident that
being aggressive
here for targets that cannot do ccmps is a good idea.

I'd hate to put this kind of target dependence into GIMPLE optimizer
behavior this early in the pipeline...


I see what you mean. I'll try instead being unconditionally
more aggressive with the AND/IOR/compare blocks and conservative
with the other types of blocks and see if that causes regressions
on a non-ccmp target. Maybe we can get a way with it...

So what I'm proposing is:
- If a target supports conditional comparisons through
TARGET_GEN_CCMP_FIRST
(currently only aarch64) then we allow the aforementioned
comparisons+logical-combinations blocks.
If the block also contains other types of operations we apply the
estimate_num_insns cost comparison
with the default value for the comparison picked to be such so that it
changes
codegen the least from
the current situation i.e. one instruction. This value will be a new
param
that targets can increase
if they want to.

- If the target does not support conditional comparisons we follow
only
the
second scheme (the conservative
estimate_num_insns comparison). This should cause the minimal codegen
difference for the targets that
don't support conditional compares (which is all targets but aarch64)
while
allowing them to scale
the aggressiveness of this transformation if their benchmarking shows
it
to be
beneficial.


    I believe such a scheme would avoid pulling in too much code that
doesn't
facilitate conditional compares generation into the unconditional path
and
would minimise
impact on existing targets that don't do conditional compares.

Would that be an acceptable plan for the ifcombine_ifandif
transformation
in
tree-ssa-ifcombine.c (pending benchmarking, of course)?

...but yes, that would be an acceptable plan.

Note that I think we run ifcombine a little early and should allow
some jump threading and phiopt to take place first.  I think
running it inbettween dce and forwprop that is followed by the 2nd phiopt
makes some sense (and has PHIOPT1 and DOM1 run first).

Maybe you can play with that idea a bit.


Will do.

Thanks,
Kyrill


Richard.

Re: [RFC] PR tree-optimization/67628: Make tree ifcombine more symmetric and interactions with dom

Reply via email to