Hi, Note that with -O1, we limit the number of iterations of DOM to 1 so that we can get resonable compile time performance. I was wondering what happens if we do the same with -O2 now that we have passes like copy-prop, VRP, and improved FRE as well as good and old CCP.
Numbers go first. All of the following numbers are obtained from compiling cc1-i files with -O2. version #iters #blocks %cov #thread ------------------------------------ v0 9097 514501 94.5% 9438 v1 8602 553685 95.5% 20868 v2 8609 510618 96.3% 23236 v3 8490 510734 97.2% 23262 v4 7193 188793 N/A 18450 Legend ------ v0 : Vanilla mainline with the second and third runs of DOM disabled. v1 : v0 with Jeff's patch in PR 19794 v2 : v1 with the following TCB-like tweak to the pass ordering NEXT_PASS (pass_ccp); NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_fre); NEXT_PASS (pass_dce); NEXT_PASS (pass_vrp); NEXT_PASS (pass_merge_phi); NEXT_PASS (pass_dominator); v3 : v2 with my queued VRP improvements v4 : v3 with the number of DOM's iterations limited to 1 #iters : the number of times we iterate in the do-while loop in tree-ssa-dom.c:tree_ssa_dominator_optimize. #blocks : the number of blocks scanned by DOM. This is done by adding one line at the entrance of the aforementioned do-while loop. num_cfg_blocks += n_basic_blocks; %cov : the percentage of invocations of DOM that finished its work within 2 iterations. #threaded : the number of edges threaded Disclaimers ----------- o I have not measured compile time performance. o I have not done benchmark on generated code. o I have not tried any other pass ordering. o I have not considered other testcases or languages. Justifications for v0, ..., v4 and columns ------------------------------------------ v0 is because I wanted to stay simple and focus on the initial basic propagation. v1 is because Jeff will soon remove various restrictions on the jump threader that have been limiting jump threading. For v2, note that ccp, copy-prop, FRE, and VRP are all something that DOM does to some extent. I put my merge_phi right before DOM so that DOM can find more jump threading opportunities in its first iteration. #iter is mostly out of curiosity. Functions that we compile come in different size, so the number isn't that important. For the same reason, #blocks is more interesting. It's a rough measure of how much time DOM takes. Now %cov. Why 2 iterations? Well, DOM has a fixed point operation, meaning that the last iteration does not do anything. More specifically, the last iteration does not perform any jump threading. Whenever jump threading opportunities are taken, cleanup_tree_cfg called from tree_ssa_dominator_optimize always returns 1, triggering the next iteration. Observations ------------ Even with v0, the vast majority of functions require at most two iterations, so we are in diminishing returns territory to begin with. With v3, we get impressive 97.2%, meaning that 97.2% of time, DOM iterates at most two times, which in turn means that 97.2% of time, we don't take any jump threading opportunity in DOM's second iteration. Note that with v4, the number of blocks that DOM looks at goes down significantly. At the same time, the number of edges threaded also goes down but is still close to the level of v1. It's interesting that even if we let DOM iterate as many times as it likes to iterate, there is a sizable difference between v1 and v2 as far as the number of edges threaded. Certain things cannot be propagated using DOM. Future work ----------- o Address things listed in Disclaimers. o Submit my queued VRP patches as soon as mainline gets a bit more stable. o Speed up things like the incremental SSA update, and the propagator. o Try out some ideas to get more out of the first iteration of DOM. o Take advantage of value ranges in thread_across_edge. Kazu Hirata