On Sat, May 31, 2014 at 8:21 PM, Sandra Loosemore <san...@codesourcery.com> wrote: > On 05/20/2014 04:56 AM, Richard Biener wrote: >> >> On Tue, May 20, 2014 at 6:29 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >>>> >>>> On 05/18/2014 08:45 PM, Sandra Loosemore wrote: >>>>> >>>>> On 05/18/2014 02:59 PM, Jan Hubicka wrote: >>>>>> >>>>>> For cases like local-statics-7 your approach can be "saved" by adding >>>>>> simple IPA analysis >>>>>> to look for static vars that are used only by one function and keeping >>>>>> your DSE code active >>>>>> for them, so we can still get rid of this special case assignments >>>>>> during late compilation. >>>>>> I am however not quite convinced it is worth the effort - do you have >>>>>> some real world >>>>>> cases where it helps? >>>>> >>>>> >>>>> Um, the well-known benchmark. ;-) >>>> >>>> >>>> I looked at the generated code for this benchmark and see that your >>>> patch is indeed not getting rid of all the pointless static >>>> variables, while ours is, so this is quite disappointing. I'm >>>> thinking now that we will still need to retain our patch at least >>>> locally, although we'd really like to get it on trunk if possible. >>> >>> >>> Yours patch can really be improved by adding simple IPA analysis to work >>> out what variables are refernced by one function only instead of using >>> not-longer-that-relevant local static info. >>> As last IPA pass for each static variable with no address taken, look at >>> all >>> references and see if they all come from one function or functions >>> inlined to >>> it. >>>> >>>> >>>> Here is another testcase vaguely based on the same kind of >>>> signal-processing algorithm as the benchmark, but coded without >>>> reference to it. I have arm-none-eabi builds around for both 4.9.0 >>>> with our remove_local_statics patch applied, and trunk with your >>>> patch. With -O2, our optimization produces 23 instructions and gets >>>> rid of all 3 statics, yours produces 33 and only gets rid of 1 of >>>> them. >>>> >>>> Of course it's lame to use static variables in this way, but OTOH >>>> it's lame if GCC can't recognize them and optimize them away, too. >>>> >>>> -Sandra >>>> >>> >>>> void >>>> fir (int *coeffs, int coeff_length, int *input, int input_length, int >>>> *output) >>>> { >>>> static int *coeffp; >>>> static int *inputp; >>>> static int *outputp; >>> >>> >>> Here inputp read is rmeoved only at 101.dceloop1 pass. That is really >>> late. >>> coeffp is removed late, too. >>> >>>> int i, c, acc; >>>> >>>> for (i = 0; i < input_length; i++) >>>> { >>>> coeffp = coeffs; >>>> inputp = input + i; >>>> outputp = output + i; >>>> acc = 0; >>>> for (c = 0; c < coeff_length; c++) >>>> { >>>> acc += *coeffp * *inputp; >>>> coeffp++; >>>> inputp--; >>>> } >>> >>> >>> It seems like AA can not work out the fact that inputp is unchanged until >>> that >>> late. Richi, any ideas? >> >> >> Well, AA is perfectly fine - it's just that this boils down to a partial >> redundancy problem. The stores can be removed by DCE even with >> >> Index: gcc/tree-ssa-dce.c >> =================================================================== >> --- gcc/tree-ssa-dce.c (revision 210635) >> +++ gcc/tree-ssa-dce.c (working copy) >> @@ -278,10 +278,20 @@ mark_stmt_if_obviously_necessary (gimple >> break; >> >> case GIMPLE_ASSIGN: >> - if (TREE_CODE (gimple_assign_lhs (stmt)) == SSA_NAME >> - && TREE_CLOBBER_P (gimple_assign_rhs1 (stmt))) >> - return; >> - break; >> + { >> + tree lhs = gimple_assign_lhs (stmt); >> + if (TREE_CODE (lhs) == SSA_NAME >> + && TREE_CLOBBER_P (gimple_assign_rhs1 (stmt))) >> + return; >> + if (TREE_CODE (lhs) == VAR_DECL >> + && !TREE_ADDRESSABLE (lhs) >> + && TREE_STATIC (lhs) >> + && !TREE_PUBLIC (lhs) && !DECL_EXTERNAL (lhs) >> + /* ??? Make sure lhs is only refered to from cfun->decl. */ >> + && decl_function_context (lhs) == cfun->decl) >> + return; >> + break; >> + } >> >> default: >> break; >> >> but I don't have a good way of checking the ??? prerequesite >> (even without taking its address the statics may be refered to >> by a) inline copies, b) ipa-split function parts, c) nested functions). >> I'm sure IPA references are not kept up-to-date. >> >> The last reads go away with PRE at the expense of two >> additional induction variables. >> >> If we know that a static variable does not have its address taken >> and is only refered to from a single function we can in theory >> simply rewrite it into SSA form during early opts (before inlining >> its body), for example by SRA. That isn't even restricted to >> function-local statics (for statics used in multiple functions but >> not having address-taken we can do the same with doing >> function entry / exit load / store). I think that would be a much >> better IPA enabled optimization than playing games with >> looking at individual stmts. > > > FAOD, I'm not currently planning to work on this any more myself. I > understand Bernd's patch pretty well and could make some minor changes to > clean it up if that's what it takes to get it approved, but it sounds like > what's wanted is a completely different approach, again, and I don't have > any confidence that I could implement something that wouldn't just get stuck > in another round of "why don't you try X instead" or "maybe it would be > better to do this in Y". :-( > > Richard, FWIW I think both of the objections you raised to Bernd's patch > last year have been resolved now. > > https://gcc.gnu.org/ml/gcc-patches/2013-06/msg00322.html > > We know that an IPA-only implementation (Jan's patch) isn't sufficient to > catch the important cases, and the Bernd previously answered the question > about inlining here: > > https://gcc.gnu.org/ml/gcc-patches/2013-06/msg00327.html
Well, I'm hesitant to add a new pass just to optimize a (irrelevant in practice) benchmark. I'm ok with strengthening existing infrastructure or enhancing existing passes. The issue with these mini-passes is that they are very placement sensitive and you don't easily get secondary effects. See the comment (and "patch") about DCE - the ??? comment can be addressed the same way Bernd addressed it (use cgraph_function_possibly_inlined_p). Would that optimize the testcase? Note that the issue with nested function use prevails (not sure how to solve that - same issue in Bernds patch). Thanks, Richard. > -Sandra >