On Thu, Feb 4, 2016 at 9:10 PM, Carlos Pita <carlosjosep...@gmail.com> wrote: > PS 2 (last one, I swear): I've isolated what I think is the root of > the problem. When einline expands g, there is plenty of call sites for > f.call, so the full redundancy elimination pass replaces sum for > f.call, making things easy for the late ipa inliner. But when g is not > early inlined, there is only one call site for the global f.call and > another one for the local/literal f.call, so the fre pass just lets > them be. This is innocuous from the fre point of view, but disables > further inlining as described above. > > On Thu, Feb 4, 2016 at 3:05 PM, Carlos Pita <carlosjosep...@gmail.com> wrote: >> PS: I framed the issue between the inline_param and fixup_cfg passes >> because I was only looking at the tree passes, but the really relevant >> passes are tree-einline (obviously) and ipa-inline, which happens >> between tree-inline_param2 and tree-fixup_cfg2. So, restating the >> problem: if early inline is not happening, late inline will miss the >> chance to inline the reference from the static const struct member. >> >> On Thu, Feb 4, 2016 at 1:08 PM, Carlos Pita <carlosjosep...@gmail.com> wrote: >>> Hi all, >>> >>> I've been trying to understand some bizarre interaction between >>> optimizing passes I've observed while compiling a heavily nested >>> inlined numerical code of mine. I managed to reduce the issue down to >>> this simple code: >>> >>> ``` test.c >>> >>> typedef struct F { >>> int (*call)(int); >>> } F; >>> >>> static int g(F f, int x) { >>> x = f.call(x); >>> x = f.call(x); >>> x = f.call(x); >>> x = f.call(x); >>> x = f.call(x); >>> x = f.call(x); >>> x = f.call(x); >>> x = f.call(x); >>> return x; >>> } >>> >>> static int sq(int x) { >>> return x * x; >>> } >>> >>> static const F f = {sq}; >>> >>> void dosomething(int); >>> >>> int h(int x) { >>> dosomething(g(f, x)); >>> dosomething(g((F){sq}, x)); >>> } >>> >>> ``` >>> >>> Here we have a driver function h calling the workhorse g which >>> delegates some simple task to the inline-wannabe f. The distinctive >>> aspect of the above scheme is that f is referenced from a struct >>> member. The first call to g passes a static const struct while the >>> second call passes a compound literal (alternatively, a local version >>> of the struct will have the same effect regarding what follows). >>> >>> Now, say I compile this code with: >>> >>> gcc -O3 -fdump-tree-all --param early-inlining-insns=10 -c test.c >>> >>> The einline pass will not be able to inline calls to g with such a low >>> value for early-inlining-insns. >>> >>> The inline_param2 pass still shows: >>> >>> ``` >>> h (int x) >>> { >>> struct F D.1847; >>> int _4; >>> int _8; >>> >>> <bb 2>: >>> _4 = g (f, x_2(D)); >>> dosomething (_4); >>> D.1847.call = sq; >>> _8 = g (D.1847, x_2(D)); >>> dosomething (_8); >>> return; >>> >>> } >>> >>> ``` >>> >>> The next tree pass is fixup_cfg4, which does the inline but just for >>> the second all to g: >>> >>> ``` >>> h (int x) >>> { >>> .... >>> >>> <bb 2>: >>> f = f; >>> f$call_7 = MEM[(struct F *)&f]; >>> x_19 = f$call_7 (x_2(D)); >>> x_20 = f$call_7 (x_19); >>> x_21 = f$call_7 (x_20); >>> x_22 = f$call_7 (x_21); >>> x_23 = f$call_7 (x_22); >>> x_24 = f$call_7 (x_23); >>> x_25 = f$call_7 (x_24); >>> x_26 = f$call_7 (x_25); >>> _43 = x_26; >>> _4 = _43; >>> dosomething (_4); >>> D.1847.call = sq; >>> f = D.1847; >>> f$call_10 = MEM[(struct F *)&f]; >>> _33 = x_2(D) * x_2(D); >>> _45 = _33; >>> x_11 = _45; >>> _32 = x_11 * x_11; >>> _46 = _32; >>> x_12 = _46; >>> _31 = x_12 * x_12; >>> _47 = _31; >>> x_13 = _47; >>> _30 = x_13 * x_13; >>> _48 = _30; >>> x_14 = _48; >>> _29 = x_14 * x_14; >>> _49 = _29; >>> x_15 = _49; >>> _28 = x_15 * x_15; >>> _50 = _28; >>> x_16 = _50; >>> _27 = x_16 * x_16; >>> _51 = _27; >>> x_17 = _51; >>> _3 = x_17 * x_17; >>> _52 = _3; >>> x_18 = _52; >>> _53 = x_18; >>> _8 = _53; >>> dosomething (_8); >>> return; >>> >>> } >>> ``` >>> >>> Now, say I recompile the code with a larger early-inlining-insns, so >>> that einline is able to early inline both calls to g: >>> >>> gcc -O3 -fdump-tree-all --param early-inlining-insns=50 -c test.c >>> >>> After inline_param2 (that is, before fixup_cfg4), we now have: >>> >>> ``` >>> h (int x) >>> { >>> int x; >>> int x; >>> >>> <bb 2>: >>> x_13 = sq (x_2(D)); >>> x_14 = sq (x_13); >>> x_15 = sq (x_14); >>> x_16 = sq (x_15); >>> x_17 = sq (x_16); >>> x_18 = sq (x_17); >>> x_19 = sq (x_18); >>> x_20 = sq (x_19); >>> dosomething (x_20); >>> x_5 = sq (x_2(D)); >>> x_6 = sq (x_5); >>> x_7 = sq (x_6); >>> x_8 = sq (x_7); >>> x_9 = sq (x_8); >>> x_10 = sq (x_9); >>> x_11 = sq (x_10); >>> x_12 = sq (x_11); >>> dosomething (x_12); >>> return; >>> >>> } >>> ``` >>> >>> And fixup_cfg4 is able to do its job for both calls: >>> >>> ``` >>> h (int x) >>> { >>> .... >>> >>> <bb 2>: >>> _36 = x_2(D) * x_2(D); >>> _37 = _36; >>> x_13 = _37; >>> _35 = x_13 * x_13; >>> _38 = _35; >>> x_14 = _38; >>> _34 = x_14 * x_14; >>> _39 = _34; >>> x_15 = _39; >>> _33 = x_15 * x_15; >>> _40 = _33; >>> x_16 = _40; >>> _32 = x_16 * x_16; >>> _41 = _32; >>> x_17 = _41; >>> _31 = x_17 * x_17; >>> _42 = _31; >>> x_18 = _42; >>> _30 = x_18 * x_18; >>> _43 = _30; >>> x_19 = _43; >>> _29 = x_19 * x_19; >>> _44 = _29; >>> x_20 = _44; >>> dosomething (x_20); >>> _28 = x_2(D) * x_2(D); >>> _45 = _28; >>> x_5 = _45; >>> _27 = x_5 * x_5; >>> _46 = _27; >>> x_6 = _46; >>> _26 = x_6 * x_6; >>> _47 = _26; >>> x_7 = _47; >>> _25 = x_7 * x_7; >>> _48 = _25; >>> x_8 = _48; >>> _24 = x_8 * x_8; >>> _49 = _24; >>> x_9 = _49; >>> _23 = x_9 * x_9; >>> _50 = _23; >>> x_10 = _50; >>> _22 = x_10 * x_10; >>> _51 = _22; >>> x_11 = _51; >>> _21 = x_11 * x_11; >>> _52 = _21; >>> x_12 = _52; >>> dosomething (x_12); >>> return; >>> >>> } >>> ``` >>> >>> The bottom line is that I get full inlining if einline manages to >>> early inline both g calls, but I get incomplete inlining otherwise. I >>> guess the problem is that fixup_cfg4 is not able to infer that >>> f$call_7 is just sq in disguise when f is the global static const >>> struct but it is able to get it when it's a local or literal one. In >>> case einline expands the code early the successive passes will make >>> fixup_cfg4 see just sq in both cases, making inlining of sq a trivial >>> matter. But if einline hits its hard limits, fixup_cfg4 will have to >>> figure out that f$call is sq by itself. >>> >>> I'm not sure whether this should be considered a proper bug or more of >>> a quirk of the inlining system one must learn to live with. In the >>> first case, I'll report it if you ask me to do it. In the second case, >>> I would like to ask for some advice about the best way to cope with >>> this scenario (besides blindly incrementing early-inlining-insns); I >>> can provide more background regarding my real use case if necessary.
I think the real missed optimization is that for the passed aggregate D.1772 D.1772.call = sq; _8 = g (D.1772, x_2(D)); IPA CP cannot figure out a jump-function with the sq function as value. It _can_ for the call passing f: callsite h/3 -> g/0 : param 0: UNKNOWN Aggregate passed by value: offset: 0, cst: sq Unknown alignment param 1: PASS THROUGH: 0, op nop_expr Unknown alignment callsite h/3 -> dosomething/4 : callsite h/3 -> g/0 : param 0: UNKNOWN Unknown alignment param 1: PASS THROUGH: 0, op nop_expr Unknown alignment not sure how hard to fix that is, but it can't be too difficult. Martin? You may want to open a bugreport. Richard. >>> Cheers >>> -- >>> Carlos