On Fri, May 12, 2017 at 7:51 PM, Bernd Schmidt <bschm...@redhat.com> wrote:
> If you look at certain testcases like the one for PR78972, you'll find that
> the code generated by TER is maximally pessimal in terms of register
> pressure: we can generate a large number of intermediate results, and defer
> all the statements that use them up.
>
> Another observation one can make is that doing TER doesn't actually buy us
> anything for a large subset of the values it finds: only a handful of places
> in the expand phase actually make use of the information. In cases where we
> know we aren't going to be making use of it, we could move expressions
> freely without doing TER-style substitution.
>
> This patch uses the information collected by TER about the moveability of
> statements and performs a mini scheduling pass with the aim of reducing
> register pressure. The heuristic is fairly simple: something that consumes
> more values than it produces is preferred. This could be tuned further, but
> it already produces pretty good results: for the 78972 testcase, the stack
> size is reduced from 2448 bytes to 288, and for PR80283, the stackframe of
> 496 bytes vanishes with the pass enabled.
>
> In terms of benchmarks I've run SPEC a few times, and the last set of
> results showed not much of a change. Getting reproducible results has been
> tricky but all runs I made have been within 0%-1% improvement.
>
> In this patch, the changed behaviour is gated with a -fschedule-ter option
> which is off by default; with that default it bootstraps and tests without
> regressions. The compiler also bootstraps with the option enabled, in that
> case there are some optimization issues. I'll address some of them with two
> followup patches, the remaining failures are:
>  * a handful of guality/PR43077.c failures
>    Debug insn generation is somewhat changed, and the peephole2 pass
>    drops one of them on the floor.
>  * three target/i386/bmi-* tests fail. These expect the combiner to
>    build certain instruction patterns, and that turns out to be a
>    little fragile. It would be nice to be able to use match.pd to
>    produce target-specific patterns during expand.
>
> Thoughts? Ok to apply?

I appreciate that you experimented with partially disabling TER.  Last year
I tried to work towards this in a more aggressive way:

https://gcc.gnu.org/ml/gcc-patches/2016-06/msg02062.html

that patch tried to preserve the scheduling effect of TER because there's
on my list of nice things to have a GIMPLE scheduling pass that should
try to reduce (SSA) register pressure and that can work with GIMPLE
data dependences.

One of the goals of the patch above was to actually _see_ the scheduling
effects in the IL.

So what I'd like to see is a simple single-BB scheduling pass right before
RTL expansion (so we get a dump file).  That can use your logic (and
"TERable" would be simply having single-uses).  The advantage of doing
this before RTL expansion is that coalescing can benefit from the scheduling
as well.

Then simply disable TER for the decide_schedule_stmt () defs during
RTL expansion.

That means the effect of TER scheduling is not fully visible but we're
a step closer.  It also means that some of the scheduling we did
in the simple scheduler persists anyway because coalescing / TER
wasn't going to undo it anyway.

In the (very) distant future I'd like to perform (more) instruction selection
on GIMPLE so that all the benefits of TER are applied before RTL
expansion.

+      tree_code c = gimple_assign_rhs_code (use_stmt);
+      if (TREE_CODE_CLASS (c) != tcc_comparison
+         && c != FMA_EXPR
+         && c != SSA_NAME
+         && c != MEM_REF
+         && c != TARGET_MEM_REF
+         && def_c != VIEW_CONVERT_EXPR)

I think on some archs it was important to handle combining
POINTER_PLUS_EXPR with NEGATE_EXPR of the offset.

Anyway, the effects of TER and where it matters are hard to
see given its recursive nature (and the history of trying to
preserve expanding of "large" GENERIC trees ...).  One would
think combine should be able to handle all those cases
(for example the FMA_EXPR one from above), but it clearly
isn't (esp. in the case of forwarding memory references).

Richard.

>
> Bernd

Reply via email to