On Thu, Feb 4, 2016 at 9:10 PM, Carlos Pita <carlosjosep...@gmail.com> wrote:
> PS 2 (last one, I swear): I've isolated what I think is the root of
> the problem. When einline expands g, there is plenty of call sites for
> f.call, so the full redundancy elimination pass replaces sum for
> f.call, making things easy for the late ipa inliner. But when g is not
> early inlined, there is only one call site for the global f.call and
> another one for the local/literal f.call, so the fre pass just lets
> them be. This is innocuous from the fre point of view, but disables
> further inlining as described above.
>
> On Thu, Feb 4, 2016 at 3:05 PM, Carlos Pita <carlosjosep...@gmail.com> wrote:
>> PS: I framed the issue between the inline_param and fixup_cfg passes
>> because I was only looking at the tree passes, but the really relevant
>> passes are tree-einline (obviously) and ipa-inline, which happens
>> between tree-inline_param2 and tree-fixup_cfg2. So, restating the
>> problem: if early inline is not happening, late inline will miss the
>> chance to inline the reference from the static const struct member.
>>
>> On Thu, Feb 4, 2016 at 1:08 PM, Carlos Pita <carlosjosep...@gmail.com> wrote:
>>> Hi all,
>>>
>>> I've been trying to understand some bizarre interaction between
>>> optimizing passes I've observed while compiling a heavily nested
>>> inlined numerical code of mine. I managed to reduce the issue down to
>>> this simple code:
>>>
>>> ``` test.c
>>>
>>> typedef struct F {
>>>   int (*call)(int);
>>> } F;
>>>
>>> static int g(F f, int x) {
>>>   x = f.call(x);
>>>   x = f.call(x);
>>>   x = f.call(x);
>>>   x = f.call(x);
>>>   x = f.call(x);
>>>   x = f.call(x);
>>>   x = f.call(x);
>>>   x = f.call(x);
>>>   return x;
>>> }
>>>
>>> static int sq(int x) {
>>>   return x * x;
>>> }
>>>
>>> static const F f = {sq};
>>>
>>> void dosomething(int);
>>>
>>> int h(int x) {
>>>   dosomething(g(f, x));
>>>   dosomething(g((F){sq}, x));
>>> }
>>>
>>> ```
>>>
>>> Here we have a driver function h calling the workhorse g which
>>> delegates some simple task to the inline-wannabe f. The distinctive
>>> aspect of the above scheme is that f is referenced from a struct
>>> member. The first call to g passes a static const struct while the
>>> second call passes a compound literal (alternatively, a local version
>>> of the struct will have the same effect regarding what follows).
>>>
>>> Now, say I compile this code with:
>>>
>>> gcc -O3 -fdump-tree-all --param early-inlining-insns=10 -c test.c
>>>
>>> The einline pass will not be able to inline calls to g with such a low
>>> value for early-inlining-insns.
>>>
>>> The inline_param2 pass still shows:
>>>
>>> ```
>>> h (int x)
>>> {
>>>   struct F D.1847;
>>>   int _4;
>>>   int _8;
>>>
>>>   <bb 2>:
>>>   _4 = g (f, x_2(D));
>>>   dosomething (_4);
>>>   D.1847.call = sq;
>>>   _8 = g (D.1847, x_2(D));
>>>   dosomething (_8);
>>>   return;
>>>
>>> }
>>>
>>> ```
>>>
>>> The next tree pass is fixup_cfg4, which does the inline but just for
>>> the second all to g:
>>>
>>> ```
>>> h (int x)
>>> {
>>>   ....
>>>
>>>   <bb 2>:
>>>   f = f;
>>>   f$call_7 = MEM[(struct F *)&f];
>>>   x_19 = f$call_7 (x_2(D));
>>>   x_20 = f$call_7 (x_19);
>>>   x_21 = f$call_7 (x_20);
>>>   x_22 = f$call_7 (x_21);
>>>   x_23 = f$call_7 (x_22);
>>>   x_24 = f$call_7 (x_23);
>>>   x_25 = f$call_7 (x_24);
>>>   x_26 = f$call_7 (x_25);
>>>   _43 = x_26;
>>>   _4 = _43;
>>>   dosomething (_4);
>>>   D.1847.call = sq;
>>>   f = D.1847;
>>>   f$call_10 = MEM[(struct F *)&f];
>>>   _33 = x_2(D) * x_2(D);
>>>   _45 = _33;
>>>   x_11 = _45;
>>>   _32 = x_11 * x_11;
>>>   _46 = _32;
>>>   x_12 = _46;
>>>   _31 = x_12 * x_12;
>>>   _47 = _31;
>>>   x_13 = _47;
>>>   _30 = x_13 * x_13;
>>>   _48 = _30;
>>>   x_14 = _48;
>>>   _29 = x_14 * x_14;
>>>   _49 = _29;
>>>   x_15 = _49;
>>>   _28 = x_15 * x_15;
>>>   _50 = _28;
>>>   x_16 = _50;
>>>   _27 = x_16 * x_16;
>>>   _51 = _27;
>>>   x_17 = _51;
>>>   _3 = x_17 * x_17;
>>>   _52 = _3;
>>>   x_18 = _52;
>>>   _53 = x_18;
>>>   _8 = _53;
>>>   dosomething (_8);
>>>   return;
>>>
>>> }
>>> ```
>>>
>>> Now, say I recompile the code with a larger early-inlining-insns, so
>>> that einline is able to early inline both calls to g:
>>>
>>> gcc -O3 -fdump-tree-all --param early-inlining-insns=50 -c test.c
>>>
>>> After inline_param2 (that is, before fixup_cfg4), we now have:
>>>
>>> ```
>>> h (int x)
>>> {
>>>   int x;
>>>   int x;
>>>
>>>   <bb 2>:
>>>   x_13 = sq (x_2(D));
>>>   x_14 = sq (x_13);
>>>   x_15 = sq (x_14);
>>>   x_16 = sq (x_15);
>>>   x_17 = sq (x_16);
>>>   x_18 = sq (x_17);
>>>   x_19 = sq (x_18);
>>>   x_20 = sq (x_19);
>>>   dosomething (x_20);
>>>   x_5 = sq (x_2(D));
>>>   x_6 = sq (x_5);
>>>   x_7 = sq (x_6);
>>>   x_8 = sq (x_7);
>>>   x_9 = sq (x_8);
>>>   x_10 = sq (x_9);
>>>   x_11 = sq (x_10);
>>>   x_12 = sq (x_11);
>>>   dosomething (x_12);
>>>   return;
>>>
>>> }
>>> ```
>>>
>>> And fixup_cfg4 is able to do its job for both calls:
>>>
>>> ```
>>> h (int x)
>>> {
>>>   ....
>>>
>>>   <bb 2>:
>>>   _36 = x_2(D) * x_2(D);
>>>   _37 = _36;
>>>   x_13 = _37;
>>>   _35 = x_13 * x_13;
>>>   _38 = _35;
>>>   x_14 = _38;
>>>   _34 = x_14 * x_14;
>>>   _39 = _34;
>>>   x_15 = _39;
>>>   _33 = x_15 * x_15;
>>>   _40 = _33;
>>>   x_16 = _40;
>>>   _32 = x_16 * x_16;
>>>   _41 = _32;
>>>   x_17 = _41;
>>>   _31 = x_17 * x_17;
>>>   _42 = _31;
>>>   x_18 = _42;
>>>   _30 = x_18 * x_18;
>>>   _43 = _30;
>>>   x_19 = _43;
>>>   _29 = x_19 * x_19;
>>>   _44 = _29;
>>>   x_20 = _44;
>>>   dosomething (x_20);
>>>   _28 = x_2(D) * x_2(D);
>>>   _45 = _28;
>>>   x_5 = _45;
>>>   _27 = x_5 * x_5;
>>>   _46 = _27;
>>>   x_6 = _46;
>>>   _26 = x_6 * x_6;
>>>   _47 = _26;
>>>   x_7 = _47;
>>>   _25 = x_7 * x_7;
>>>   _48 = _25;
>>>   x_8 = _48;
>>>   _24 = x_8 * x_8;
>>>   _49 = _24;
>>>   x_9 = _49;
>>>   _23 = x_9 * x_9;
>>>   _50 = _23;
>>>   x_10 = _50;
>>>   _22 = x_10 * x_10;
>>>   _51 = _22;
>>>   x_11 = _51;
>>>   _21 = x_11 * x_11;
>>>   _52 = _21;
>>>   x_12 = _52;
>>>   dosomething (x_12);
>>>   return;
>>>
>>> }
>>> ```
>>>
>>> The bottom line is that I get full inlining if einline manages to
>>> early inline both g calls, but I get incomplete inlining otherwise. I
>>> guess the problem is that fixup_cfg4 is not able to infer that
>>> f$call_7 is just sq in disguise when f is the global static const
>>> struct but it is able to get it when it's a local or literal one. In
>>> case einline expands the code early the successive passes will make
>>> fixup_cfg4 see just sq in both cases, making inlining of sq a trivial
>>> matter. But if einline hits its hard limits, fixup_cfg4 will have to
>>> figure out that f$call is sq by itself.
>>>
>>> I'm not sure whether this should be considered a proper bug or more of
>>> a quirk of the inlining system one must learn to live with. In the
>>> first case, I'll report it if you ask me to do it. In the second case,
>>> I would like to ask for some advice about the best way to cope with
>>> this scenario (besides blindly incrementing early-inlining-insns); I
>>> can provide more background regarding my real use case if necessary.

I think the real missed optimization is that for the passed aggregate D.1772

 D.1772.call = sq;
  _8 = g (D.1772, x_2(D));

IPA CP cannot figure out a jump-function with the sq function as
value.  It _can_
for the call passing f:

    callsite  h/3 -> g/0 :
       param 0: UNKNOWN
         Aggregate passed by value:
           offset: 0, cst: sq
         Unknown alignment
       param 1: PASS THROUGH: 0, op nop_expr
         Unknown alignment
    callsite  h/3 -> dosomething/4 :
    callsite  h/3 -> g/0 :
       param 0: UNKNOWN
         Unknown alignment
       param 1: PASS THROUGH: 0, op nop_expr
         Unknown alignment

not sure how hard to fix that is, but it can't be too difficult.  Martin?

You may want to open a bugreport.

Richard.

>>> Cheers
>>> --
>>> Carlos

Reply via email to