On Mon, Oct 8, 2012 at 12:01 PM, Jan Hubicka <hubi...@ucw.cz> wrote: >> On Mon, Oct 8, 2012 at 11:04 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >> >> On Mon, Oct 8, 2012 at 4:50 AM, Dehao Chen <de...@google.com> wrote: >> >> > Attached is the updated patch. Yes, if we add a VRP pass before >> >> > profile pass, this patch would be unnecessary. Should we add a VRP >> >> > pass? >> >> >> >> No, we don't want VRP in early optimizations. >> > >> > I am not quite sure about that. VRP >> > 1) makes branch prediction work better by doing jump threading early >> >> Well ... but jump threading may need basic-block duplication which may >> increase code size. Also VRP and FRE have pass ordering issues. >> >> > 2) is, after FRE, most effective tree pass on removing code by my profile >> > statistics. >> >> We also don't have DSE in early opts. I don't want to end up with the >> situation that we do everything in early opts ... we should do _less_ there >> (but eventually iterate properly when processing cycles). > > Yep, i am not quite sure about most sane variant. Missed simple jump > threading > in early opts definitely confuse both profile estimate and inline size > estimates. But I am also not thrilled by adding more passes to early opts at > all. Also last time I looked into this, CCP missed a lot of CCP oppurtunities > making VRP to artifically look like more useful.
Eh .. that shouldn't happen. Do you have testcases by any chance? I used to duplicate each SSA propagator pass and checked -fdump-statistics-stats for that the 2nd pass does nothing (thus chaining CCP doesn't improve results). But maybe that's not the issue you run into here? > Have patch that bit improves profile updating after jump threading (i.e. > re-does the profile for simple cases), but still jump threading is the most > common case for profile become inconsistent after expand. > > On a related note, with -fprofile-report I can easilly track how much of code > each pass in the queue removed. I was thinking about running this on Mozilla > and -O1 and removing those passes that did almost nothing. Those are mostly > re-run passes, both at Gimple and RTL level. Our passmanager is not terribly > friendly for controlling pass per-repetition. Sure. You can also more thorougly instrument passes and use -fdump-statistics for that (I've done that), but we usually have testcases that require that each pass that still is there is present ... > With introduction of -Og pass queue, do you think introducing -O1 pass queue > for late tree passes (that will be quite short) is sane? Yes. I don't like the dump-file naming mess that results though, but if we want to support optimized attribute switching between -O1 and -O2 then I guess we have to live with that ... Originally I wanted to base -Og on -O1 (thus have them mostly share the pass queue) and retain the same pass queue for -O2 and -Os. Maybe that's what we eventually want to do. Thus, add a (off for -Og) loop optimizer sub-pass to the queue and schedule some scalar cleanups after it but inside it. > What about RTL > level? I guess we can split the queues for RTL optimizations, too. > All optimizations passes prior register allocation are sort of optional > and I guess there are also -Og candidates. Yes. Though I first wanted to see actual issues with the RTL optimizers and -Og. > I hoever find the 3 times duplicated queues bit uncool, too, but I guess > it is most compatible with PM organization. Indeed ;) We should at least try to share the queues for -Og and -O1. > At -O3 the most effective passes on combine.c > are: > > cfg (because of cfg cleanup) -1.5474% > Early inlning -0.4991% > FRE -7.9369% > VRP -0.9321% (if run early), ccp does -0.2273% I think VRP has the advantage of taking loop iteration counts into account. Maybe we can add sth similar to CCP. It's sad that VRP is too expensive, it really is a form of CCP so merging both passes would be best (we can at a single point, add_equivalence, turn off equivalence processing - the most expensive part of VRP, and call that CCP ...). > tailr -0.5305% > > After IPA > copyrename -2.2850% (it packs cleanups after inlining) > forwprop -0.5432% > VRP -0.9700% (if rerun after early passes, otherwise it is about 2%) > PRE -2.4123% > DOM -0.5182% > > RTL passes > into_cfglayout -3.1400% (i.e. first cleanup_cfg) > fwprop1 -3.0467% > cprop -2.7786% > combine -3.3346% > IRA -3.4912% (i.e. the cost model preffers hard regs) > bbro -0.9765% > > The numbers on tramp3d and LTO cc1 binary and not that different. Yes. Richard. > Honza