Re: [ping patch] Predict for loop exits in short-circuit conditions

Richard Guenther Mon, 08 Oct 2012 03:13:20 -0700

On Mon, Oct 8, 2012 at 12:01 PM, Jan Hubicka <hubi...@ucw.cz> wrote:
>> On Mon, Oct 8, 2012 at 11:04 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
>> >> On Mon, Oct 8, 2012 at 4:50 AM, Dehao Chen <de...@google.com> wrote:
>> >> > Attached is the updated patch. Yes, if we add a VRP pass before
>> >> > profile pass, this patch would be unnecessary. Should we add a VRP
>> >> > pass?
>> >>
>> >> No, we don't want VRP in early optimizations.
>> >
>> > I am not quite sure about that.  VRP
>> >  1) makes branch prediction work better by doing jump threading early
>>
>> Well ... but jump threading may need basic-block duplication which may
>> increase code size.  Also VRP and FRE have pass ordering issues.
>>
>> >  2) is, after FRE, most effective tree pass on removing code by my profile
>> >     statistics.
>>
>> We also don't have DSE in early opts.  I don't want to end up with the
>> situation that we do everything in early opts ... we should do _less_ there
>> (but eventually iterate properly when processing cycles).
>
> Yep, i am not quite sure about most sane variant.  Missed simple jump 
> threading
> in early opts definitely confuse both profile estimate and inline size
> estimates.  But I am also not thrilled by adding more passes to early opts at
> all.  Also last time I looked into this, CCP missed a lot of CCP oppurtunities
> making VRP to artifically look like more useful.


Eh .. that shouldn't happen.  Do you have testcases by any chance?
I used to duplicate each SSA propagator pass and checked -fdump-statistics-stats
for that the 2nd pass does nothing (thus chaining CCP doesn't improve results).
But maybe that's not the issue you run into here?

> Have patch that bit improves profile updating after jump threading (i.e.
> re-does the profile for simple cases), but still jump threading is the most
> common case for profile become inconsistent after expand.
>
> On a related note, with -fprofile-report I can easilly track how much of code
> each pass in the queue removed.  I was thinking about running this on Mozilla
> and -O1 and removing those passes that did almost nothing.  Those are mostly
> re-run passes, both at Gimple and RTL level. Our passmanager is not terribly
> friendly for controlling pass per-repetition.

Sure.  You can also more thorougly instrument passes and use
-fdump-statistics for that (I've done that), but we usually have testcases
that require that each pass that still is there is present ...

> With introduction of -Og pass queue, do you think introducing -O1 pass queue
> for late tree passes (that will be quite short) is sane?

Yes.  I don't like the dump-file naming mess that results though, but if
we want to support optimized attribute switching between -O1 and -O2
then I guess we have to live with that ...

Originally I wanted to base -Og on -O1 (thus have them mostly share the
pass queue) and retain the same pass queue for -O2 and -Os.  Maybe
that's what we eventually want to do.  Thus, add a (off for -Og) loop
optimizer sub-pass to the queue and schedule some scalar cleanups
after it but inside it.

> What about RTL
> level?  I guess we can split the queues for RTL optimizations, too.
> All optimizations passes prior register allocation are sort of optional
> and I guess there are also -Og candidates.

Yes.  Though I first wanted to see actual issues with the RTL optimizers
and -Og.

> I hoever find the 3 times duplicated queues bit uncool, too, but I guess
> it is most compatible with PM organization.

Indeed ;)  We should at least try to share the queues for -Og and -O1.

> At -O3 the most effective passes on combine.c
> are:
>
> cfg (because of cfg cleanup) -1.5474%
> Early inlning -0.4991%
> FRE -7.9369%
> VRP -0.9321% (if run early), ccp does -0.2273%

I think VRP has the advantage of taking loop iteration counts into account.
Maybe we can add sth similar to CCP.  It's sad that VRP is too expensive,
it really is a form of CCP so merging both passes would be best (we can
at a single point, add_equivalence, turn off equivalence processing - the most
expensive part of VRP, and call that CCP ...).

> tailr -0.5305%
>
> After IPA
> copyrename -2.2850% (it packs cleanups after inlining)
> forwprop -0.5432%
> VRP -0.9700% (if rerun after early passes, otherwise it is about 2%)
> PRE -2.4123%
> DOM -0.5182%
>
> RTL passes
> into_cfglayout -3.1400% (i.e. first cleanup_cfg)
> fwprop1 -3.0467%
> cprop -2.7786%
> combine -3.3346%
> IRA -3.4912% (i.e. the cost model preffers hard regs)
> bbro -0.9765%
>
> The numbers on tramp3d and LTO cc1 binary and not that different.

Yes.

Richard.

> Honza

Re: [ping patch] Predict for loop exits in short-circuit conditions

Reply via email to