On Fri, Jan 13, 2017 at 7:48 PM, Aldy Hernandez <al...@redhat.com> wrote: > [Sorry for the delay, I was sick.] > > > On 01/09/2017 04:30 AM, Richard Biener wrote: >> >> On Sat, Jan 7, 2017 at 1:54 PM, Aldy Hernandez <al...@redhat.com> wrote: >>> >>> On 01/04/2017 07:11 AM, Richard Biener wrote: >>>> >>>> >>>> On Tue, Jan 3, 2017 at 6:36 PM, Aldy Hernandez <al...@redhat.com> wrote: >>>>> >>>>> >>>>> On 12/20/2016 09:16 AM, Richard Biener wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Dec 16, 2016 at 3:41 PM, Aldy Hernandez <al...@redhat.com> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi folks. >>>>>>> >>>>>>> This is a follow-up on Jeff and Richi's interaction on the >>>>>>> aforementioned >>>>>>> PR >>>>>>> here: >>>>>>> >>>>>>> https://gcc.gnu.org/ml/gcc-patches/2016-08/msg01397.html >>>>>>> >>>>>>> I decided to explore the idea of analyzing may-undefness on-demand, >>>>>>> which >>>>>>> actually looks rather cheap. >>>>>>> >>>>>>> BTW, I don't understand why we don't have auto_bitmap's, as we >>>>>>> already >>>>>>> have >>>>>>> auto_sbitmap's. I've implemented the former based on auto_sbitmap's >>>>>>> code >>>>>>> we >>>>>>> already have. >>>>>>> >>>>>>> The attached patch fixes the bug without introducing any regressions. >>>>>>> >>>>>>> I also tested the patch by compiling 242 .ii files with -O3. These >>>>>>> were >>>>>>> gathered from a stage1 build with -save-temps. There is a slight >>>>>>> time >>>>>>> degradation of 4 seconds within 27 minutes of user time: >>>>>>> >>>>>>> tainted: 26:52 >>>>>>> orig: 26:48 >>>>>>> >>>>>>> This was the average aggregate time of two runs compiling all 242 .ii >>>>>>> files. >>>>>>> IMO, this looks reasonable. It is after all, -O3. Is it >>>>>>> acceptable? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> + while (!worklist.is_empty ()) >>>>>> + { >>>>>> + name = worklist.pop (); >>>>>> + gcc_assert (TREE_CODE (name) == SSA_NAME); >>>>>> + >>>>>> + if (ssa_undefined_value_p (name, true)) >>>>>> + return true; >>>>>> + >>>>>> + bitmap_set_bit (visited_ssa, SSA_NAME_VERSION (name)); >>>>>> >>>>>> it should be already set as we use visited_ssa as "was it ever on the >>>>>> worklist", >>>>>> so maybe renaming it would be a good thing as well. >>>>> >>>>> >>>>> >>>>> >>>>> I don't understand what you would prefer here. >>>> >>>> >>>> >>>> Set the bit when you put name on the worklist (and do not do that if the >>>> bit is set). Thus simply remove the above and add a bitmap_set_bit >>>> for the initial name you put on the worklist. >>>> >>>>>> >>>>>> + if (TREE_CODE (name) == SSA_NAME) >>>>>> + { >>>>>> + /* If an SSA has already been seen, this may be a >>>>>> loop. >>>>>> + Fail conservatively. */ >>>>>> + if (bitmap_bit_p (visited_ssa, SSA_NAME_VERSION >>>>>> (name))) >>>>>> + return false; >>>>>> >>>>>> so to me "conservative" is returning true, not false. >>>>> >>>>> >>>>> >>>>> >>>>> OK >>>>> >>>>>> >>>>>> + bitmap_set_bit (visited_ssa, SSA_NAME_VERSION >>>>>> (name)); >>>>>> + worklist.safe_push (name); >>>>>> >>>>>> but for loops we can just continue and ignore this use. And >>>>>> bitmap_set_bit >>>>>> returns whether it set a bit, thus >>>>>> >>>>>> if (bitmap_set_bit (visited_ssa, SSA_NAME_VERSION >>>>>> (name))) >>>>>> worklist.safe_push (name); >>>>>> >>>>>> should work? >>>>> >>>>> >>>>> >>>>> >>>>> Fixed. >>>>> >>>>>> >>>>>> + /* Check that any SSA names used to define NAME is also fully >>>>>> + defined. */ >>>>>> + use_operand_p use_p; >>>>>> + ssa_op_iter iter; >>>>>> + FOR_EACH_SSA_USE_OPERAND (use_p, def, iter, SSA_OP_USE) >>>>>> + { >>>>>> + name = USE_FROM_PTR (use_p); >>>>>> + if (TREE_CODE (name) == SSA_NAME) >>>>>> >>>>>> always true. >>>>>> >>>>>> + { >>>>>> + /* If an SSA has already been seen, this may be a loop. >>>>>> + Fail conservatively. */ >>>>>> + if (bitmap_bit_p (visited_ssa, SSA_NAME_VERSION (name))) >>>>>> + return false; >>>>>> + bitmap_set_bit (visited_ssa, SSA_NAME_VERSION (name)); >>>>>> + worklist.safe_push (name); >>>>>> >>>>>> See above. >>>>>> >>>>>> In principle the thing is sound but I'd like to be able to pass in a >>>>>> bitmap of >>>>>> known maybe-undefined/must-defined SSA names to have a cache for >>>>>> multiple invocations from within a pass (like this unswitching case). >>>>> >>>>> >>>>> >>>>> >>>>> Done, though bitmaps are now calculated as part of the instantiation. >>>>> >>>>>> >>>>>> Also once you hit defs that are in a post-dominated region of the loop >>>>>> entry >>>>>> you can treat them as not undefined (as their use invokes undefined >>>>>> behavior). This is also how you treat function parameters (well, >>>>>> ssa_undefined_value_p does), where the call site invokes undefined >>>>>> behavior >>>>>> when passing in undefined values. So we need an extra parameter >>>>>> specifying >>>>>> the post-dominance region. >>>>> >>>>> >>>>> >>>>> >>>>> Done. >>>>> >>>>>> >>>>>> You do not handle memory or calls conservatively which means the >>>>>> existing >>>>>> testcase only needs some obfuscation to become a problem again. To >>>>>> fix >>>>>> that before /* Check that any SSA names used to define NAME is also >>>>>> fully >>>>>> defined. */ bail out conservatively, like >>>>>> >>>>>> if (! is_gimple_assign (def) >>>>>> || gimple_assign_single_p (def)) >>>>>> return true; >>>>> >>>>> >>>>> >>>>> >>>>> As I asked previously, I understand the !is_gimple_assign, which will >>>>> skip >>>>> over GIMPLE_CALLs setting a value, but the "gimple_assign_single_p(def) >>>>> == >>>>> true"?? >>>>> >>>>> Won't this last one bail on things like e.3_7 = e, or x_77 = y_88? >>>>> Don't >>>>> we >>>>> want to follow up the def chain precisely on these? >>>>> >>>>> The attached implementation uses a cache, and a pre-computed >>>>> post-dominance >>>>> region. A previous incantation of this patch (sans the post-dominance >>>>> stuff) had performance within the noise of the previous implementation. >>>>> >>>>> I am testing again, and will do some performance comparisons in a bit, >>>>> but >>>>> for now-- are we on the same page here? Is this what you wanted? >>>> >>>> >>>> >>>> + /* DEFs in the post-dominated region of the loop entry invoke >>>> + undefined behavior. Adding any use won't make things any >>>> + worse. */ >>>> + for (unsigned i = 0; i < postdom.length (); ++i) >>>> + if (def->bb == postdom[i]) >>>> >>>> gimple_bb (def) >>>> >>>> + { >>>> + set_defined (t); >>>> + return false; >>>> + } >>>> >>>> I think you can't really return false here but need to continue >>>> processing >>>> the worklist. I'd say rather than a linear walk you can as well use >>>> dominated_by_p (CDI_POST_DOMINATORS, ...) and record the entry >>>> block instead? >>>> >>>> Note that the way you compute post-dominators doesn't work -- nothing >>>> keeps them up-to-date when unswitching does CFG manipulations :/ >>>> Fortunately we have a way to compute them per region, see how >>>> tree-if-conv.c >>>> does that (build_region, calculate_dominance_info_for_region). >>> >>> >>> >>> If I understand correctly, we could compute them per region in >>> tree_unswitch_single_loop() for each individual loop with what >>> tree-if-conv.c uses. >>> >>> I could certainly abstract >>> build_region/calculate_dominance_info_for_region >>> into something new, say calculate_dominance_info_for_loop_region(). Then >>> we >>> could use dominated_by_p(CDI_POST_DOMINATORS, ...) in my class as you >>> suggest. >>> >>> Question... computing the post-dom tree per region as I've just described >>> will only create post-dom information for the loop region, which means >>> that >>> any definition outside of the loop will have zero dominance info. >> >> >> Yes. >> >>> What if the definition in question post-dominates the loop entry but is >>> outside/after of the loop? >> >> >> How would we ever arrive at such def? Once we reach a def before the >> region/loop >> we know it's fine to use (the missing "stop" point I pointed out). > > > Hmmm, it looks like we can't even build the post-dom region appropriately in > the presence of infinite loops. > > First, build_region() fails if it can't find a loop post-header: > > /* The last element is loop post-header. */ > gcc_assert (exit_bb); > > which we won't have in an infinite loop like: > > bb2: //loop header > | > V > loop 1: > bb3: <-------+ > bar(); | > | | > V | > bb4: | > goto bb3; >---+ > > > We could just build the region without the post-header, but then we fail > while building the DFS dominance region: > > dom_info::calc_dfs_tree_nonrec (basic_block bb) > .... > gcc_assert (bn != m_start_block); > > Presumably because we've looped back to the start block (bb4 in a post > dominator world). This doesn't happen calculating going forward because we > always have a start block outside of the region (the function entry block). > > I tried to fake edge my way out of it, but things get really messed up while > putting/removing fake edges in the presence of loop info. I'm assuming all > this is by design.
Heh. Infinite loops are indeed fun, and yes, the fake edges confuse some of the loop stuff. > Would you prefer me to continue down this path, trying to build a post-dom > region with infinite loops and fixing build_region / calc_dfs_tree_nonrec, > or could we do without this dominance optimization? Pretty please? It will pessimize some cases though, but see below. >> >>> We will have no post-dom information for this. >>> In this case, could we just ignore and continue evaluating the SSA_NAME >>> with >>> the rest of our heuristics, or did you have something else in mind? >>> Another >>> option would be to calculate the post-dom information for the entire >>> function on every loop (calculate_dominance_info(CDI_POST_DOMINATORS)), >>> but >>> that seems rather expensive. >>> >>> As a side note, at this point I just want to fix/close the PR in an >>> acceptable manner, not come up with the end-all catch-all most-efficient >>> patch for an unlikely scenario ;-). >> >> >> Yeah, which is why I wonder whether the caching is worth the trouble (in >> it's >> current form) for the unswitching use (given it's other restrictions >> on conditions >> to unswitch). > > > We could go back to my original, no caching version (with the other > suggestions). That solves the problem quite simply, with no regressions. So let's go with a unswitching-local solution then. Based on tree_may_unswitch_on: /* Condition must be invariant. */ FOR_EACH_SSA_TREE_OPERAND (use, stmt, iter, SSA_OP_USE) { /* Unswitching on undefined values would introduce undefined behavior that the original program might never exercise. */ if (ssa_undefined_value_p (use, true)) return NULL_TREE; def = SSA_NAME_DEF_STMT (use); def_bb = gimple_bb (def); if (def_bb && flow_bb_inside_loop_p (loop, def_bb)) return NULL_TREE; we only have to look for uses in blocks dominating the loop header block (or in blocks post-dominating that loop header, but we can probably implement that by simply including the loop header itself with a FIXME comment). Note that we only need to know whether a BB will be always executed when the loop is entered -- there's "infrastructure" elsewhere that computes this w/o the need of post-dominance. For example see fill_always_executed_in_1 tree-ssa-loop-im.c (we can't use that 1:1 I think because we already use ->aux via the original copy tables, but we could simplify it as we're only interested in the loop which preheader we put the unswitch condition on so we can use a bitmap to record whether a block of the loop is always executed or not). Can you send a patch that does this maybe? Thanks, Richard. > Thanks again. > Aldy