On Sat, May 31, 2014 at 8:21 PM, Sandra Loosemore
<san...@codesourcery.com> wrote:
> On 05/20/2014 04:56 AM, Richard Biener wrote:
>>
>> On Tue, May 20, 2014 at 6:29 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
>>>>
>>>> On 05/18/2014 08:45 PM, Sandra Loosemore wrote:
>>>>>
>>>>> On 05/18/2014 02:59 PM, Jan Hubicka wrote:
>>>>>>
>>>>>> For cases like local-statics-7 your approach can be "saved" by adding
>>>>>> simple IPA analysis
>>>>>> to look for static vars that are used only by one function and keeping
>>>>>> your DSE code active
>>>>>> for them, so we can still get rid of this special case assignments
>>>>>> during late compilation.
>>>>>> I am however not quite convinced it is worth the effort - do you have
>>>>>> some real world
>>>>>> cases where it helps?
>>>>>
>>>>>
>>>>> Um, the well-known benchmark.  ;-)
>>>>
>>>>
>>>> I looked at the generated code for this benchmark and see that your
>>>> patch is indeed not getting rid of all the pointless static
>>>> variables, while ours is, so this is quite disappointing.  I'm
>>>> thinking now that we will still need to retain our patch at least
>>>> locally, although we'd really like to get it on trunk if possible.
>>>
>>>
>>> Yours patch can really be improved by adding simple IPA analysis to work
>>> out what variables are refernced by one function only instead of using
>>> not-longer-that-relevant local static info.
>>> As last IPA pass for each static variable with no address taken, look at
>>> all
>>> references and see if they all come from one function or functions
>>> inlined to
>>> it.
>>>>
>>>>
>>>> Here is another testcase vaguely based on the same kind of
>>>> signal-processing algorithm as the benchmark, but coded without
>>>> reference to it.  I have arm-none-eabi builds around for both 4.9.0
>>>> with our remove_local_statics patch applied, and trunk with your
>>>> patch.  With -O2, our optimization produces 23 instructions and gets
>>>> rid of all 3 statics, yours produces 33 and only gets rid of 1 of
>>>> them.
>>>>
>>>> Of course it's lame to use static variables in this way, but OTOH
>>>> it's lame if GCC can't recognize them and optimize them away, too.
>>>>
>>>> -Sandra
>>>>
>>>
>>>> void
>>>> fir (int *coeffs, int coeff_length, int *input, int input_length, int
>>>> *output)
>>>> {
>>>>    static int *coeffp;
>>>>    static int *inputp;
>>>>    static int *outputp;
>>>
>>>
>>> Here inputp read is rmeoved only at 101.dceloop1 pass. That is really
>>> late.
>>> coeffp is removed late, too.
>>>
>>>>    int i, c, acc;
>>>>
>>>>    for (i = 0; i < input_length; i++)
>>>>      {
>>>>        coeffp = coeffs;
>>>>        inputp = input + i;
>>>>        outputp = output + i;
>>>>        acc = 0;
>>>>        for (c = 0; c < coeff_length; c++)
>>>>        {
>>>>          acc += *coeffp * *inputp;
>>>>          coeffp++;
>>>>          inputp--;
>>>>        }
>>>
>>>
>>> It seems like AA can not work out the fact that inputp is unchanged until
>>> that
>>> late.  Richi, any ideas?
>>
>>
>> Well, AA is perfectly fine - it's just that this boils down to a partial
>> redundancy problem.  The stores can be removed by DCE even with
>>
>> Index: gcc/tree-ssa-dce.c
>> ===================================================================
>> --- gcc/tree-ssa-dce.c  (revision 210635)
>> +++ gcc/tree-ssa-dce.c  (working copy)
>> @@ -278,10 +278,20 @@ mark_stmt_if_obviously_necessary (gimple
>>         break;
>>
>>       case GIMPLE_ASSIGN:
>> -      if (TREE_CODE (gimple_assign_lhs (stmt)) == SSA_NAME
>> -         && TREE_CLOBBER_P (gimple_assign_rhs1 (stmt)))
>> -       return;
>> -      break;
>> +      {
>> +       tree lhs = gimple_assign_lhs (stmt);
>> +       if (TREE_CODE (lhs) == SSA_NAME
>> +           && TREE_CLOBBER_P (gimple_assign_rhs1 (stmt)))
>> +         return;
>> +       if (TREE_CODE (lhs) == VAR_DECL
>> +           && !TREE_ADDRESSABLE (lhs)
>> +           && TREE_STATIC (lhs)
>> +           && !TREE_PUBLIC (lhs) && !DECL_EXTERNAL (lhs)
>> +           /* ???  Make sure lhs is only refered to from cfun->decl.  */
>> +           && decl_function_context (lhs) == cfun->decl)
>> +         return;
>> +       break;
>> +      }
>>
>>       default:
>>         break;
>>
>> but I don't have a good way of checking the ??? prerequesite
>> (even without taking its address the statics may be refered to
>> by a) inline copies, b) ipa-split function parts, c) nested functions).
>> I'm sure IPA references are not kept up-to-date.
>>
>> The last reads go away with PRE at the expense of two
>> additional induction variables.
>>
>> If we know that a static variable does not have its address taken
>> and is only refered to from a single function we can in theory
>> simply rewrite it into SSA form during early opts (before inlining
>> its body), for example by SRA.  That isn't even restricted to
>> function-local statics (for statics used in multiple functions but
>> not having address-taken we can do the same with doing
>> function entry / exit load / store).  I think that would be a much
>> better IPA enabled optimization than playing games with
>> looking at individual stmts.
>
>
> FAOD, I'm not currently planning to work on this any more myself.  I
> understand Bernd's patch pretty well and could make some minor changes to
> clean it up if that's what it takes to get it approved, but it sounds like
> what's wanted is a completely different approach, again, and I don't have
> any confidence that I could implement something that wouldn't just get stuck
> in another round of "why don't you try X instead" or "maybe it would be
> better to do this in Y".  :-(
>
> Richard, FWIW I think both of the objections you raised to Bernd's patch
> last year have been resolved now.
>
> https://gcc.gnu.org/ml/gcc-patches/2013-06/msg00322.html
>
> We know that an IPA-only implementation (Jan's patch) isn't sufficient to
> catch the important cases, and the Bernd previously answered the question
> about inlining here:
>
> https://gcc.gnu.org/ml/gcc-patches/2013-06/msg00327.html

Well, I'm hesitant to add a new pass just to optimize a (irrelevant in practice)
benchmark.  I'm ok with strengthening existing infrastructure or enhancing
existing passes.

The issue with these mini-passes is that they are very placement sensitive
and you don't easily get secondary effects.  See the comment (and
"patch") about DCE - the ??? comment can be addressed the same
way Bernd addressed it (use cgraph_function_possibly_inlined_p).
Would that optimize the testcase?

Note that the issue with nested function use prevails (not sure how to
solve that - same issue in Bernds patch).

Thanks,
Richard.

> -Sandra
>

Reply via email to