https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85103
Jan Hubicka <hubicka at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |WAITING
--- Comment #18 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
I see, I can reproduce it when I remove __inline__ in the function and build
with O3. What happens is the following. MainGtU is quite large hand unrolled
implementation of N^2logN sorting
Bool mainGtU ( UInt32 i1,
UInt32 i2,
UChar* block,
UInt16* quadrant,
UInt32 nblock,
Int32* budget )
{
Int32 k;
UChar c1, c2;
UInt16 s1, s2;
AssertD ( i1 != i2, "mainGtU" );
/* 1 */
c1 = block[i1]; c2 = block[i2];
if (c1 != c2) return (c1 > c2);
i1++; i2++;
/* 2 */
c1 = block[i1]; c2 = block[i2];
if (c1 != c2) return (c1 > c2);
i1++; i2++;
/* 3 */
c1 = block[i1]; c2 = block[i2];
if (c1 != c2) return (c1 > c2);
i1++; i2++;
/* 4 */
c1 = block[i1]; c2 = block[i2];
if (c1 != c2) return (c1 > c2);
i1++; i2++;
...
we decide to split it after first 5 conditionals for some reason.
The partial function is large
IPA function summary for mainGtU.part.0/41 inlinable
global time: 240.000000
self size: 243
global size: 243
min size: 0
self stack: 0
global stack: 0
size:200.000000, time:199.000000
size:4.000000, time:2.000000, executed if:(not inlined)
size:10.000000, time:10.000000, nonconst if:(op0 changed)
size:10.000000, time:10.000000, nonconst if:(op1 changed)
size:9.000000, time:9.000000, nonconst if:(op0 changed || op2 changed)
size:9.000000, time:9.000000, nonconst if:(op1 changed || op2 changed)
size:1.000000, time:1.000000, nonconst if:(op4 changed)
calls:
While the outer function is:
IPA function summary for mainGtU/30 inlinable
global time: 29.125000
self size: 36
global size: 36
min size: 16
self stack: 0
global stack: 0
size:15.000000, time:15.000000
size:3.000000, time:2.000000, executed if:(not inlined)
size:3.000000, time:3.000000, nonconst if:(op0 changed || op2 changed)
size:3.000000, time:3.000000, nonconst if:(op2 changed || op1 changed)
size:2.000000, time:2.000000, nonconst if:(op0 changed)
size:2.000000, time:2.000000, nonconst if:(op1 changed)
calls:
mainGtU.part.0/41 function not considered for inlining
loop depth: 0 freq:0.12 size: 8 time: 17 callee size:121 stack: 0
with 200 instructions that we think can't be optimized (I am not sure why we do
not track accesses to individual block indices).
Later we indeed consider mainGtU.part before the split away part:
Badness calculation for mainGtU/30 -> mainGtU.part.0/41
size growth 231, time 238.000000 unspec 240.000000
-0.000181: guessed profile. frequency 0.125000, count -1 caller count -1
time w/o inlining 59.125000, time with inlining 56.750000 overall growth -12
(current) -12 (original) -12 (compensated)
later we consider the individual parts
Estimated badness is -0.000001, frequency 7718.74.
Badness calculation for mainSimpleSort/32 -> mainGtU/30
size growth 256, time 54.750000 unspec 56.750000 big_speedup
-0.000001: guessed profile. frequency 7718.740543, count -1 caller count
-1 time w/o inlining 827687.848633, time with inlining 681031.777832 overall
growth 501 (current) 39 (original) 1521 (compensated)
Adjusted by hints -0.000001
and inline first one because speedup is considered to be big, but after
inlining the function becomes heavy and remaining two are not inlined.
There is mis-accounting bug for the time needed for execution of manGtU. I
fixed it yesterday for trunk which now has more realistic time estimate for the
sequence of ifs:
IPA function summary for mainGtU.part.0/41 inlinable
global time: 19.766641
self size: 243
global size: 243
min size: 0
self stack: 0
global stack: 0
size:200.000000, time:9.771543
size:4.000000, time:2.004863, executed if:(not inlined)
size:10.000000, time:1.998047, nonconst if:(op0 changed)
size:10.000000, time:1.998047, nonconst if:(op1 changed)
size:9.000000, time:1.996094, nonconst if:(op0 changed || op2 changed)
size:9.000000, time:1.996094, nonconst if:(op1 changed || op2 changed)
size:1.000000, time:0.001953, nonconst if:(op4 changed)
calls:
which makes it to be inlined. Does it solve the perofmrance problem for both
benchmarks?