https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85103
Jan Hubicka <hubicka at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |WAITING --- Comment #18 from Jan Hubicka <hubicka at gcc dot gnu.org> --- I see, I can reproduce it when I remove __inline__ in the function and build with O3. What happens is the following. MainGtU is quite large hand unrolled implementation of N^2logN sorting Bool mainGtU ( UInt32 i1, UInt32 i2, UChar* block, UInt16* quadrant, UInt32 nblock, Int32* budget ) { Int32 k; UChar c1, c2; UInt16 s1, s2; AssertD ( i1 != i2, "mainGtU" ); /* 1 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; /* 2 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; /* 3 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; /* 4 */ c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; ... we decide to split it after first 5 conditionals for some reason. The partial function is large IPA function summary for mainGtU.part.0/41 inlinable global time: 240.000000 self size: 243 global size: 243 min size: 0 self stack: 0 global stack: 0 size:200.000000, time:199.000000 size:4.000000, time:2.000000, executed if:(not inlined) size:10.000000, time:10.000000, nonconst if:(op0 changed) size:10.000000, time:10.000000, nonconst if:(op1 changed) size:9.000000, time:9.000000, nonconst if:(op0 changed || op2 changed) size:9.000000, time:9.000000, nonconst if:(op1 changed || op2 changed) size:1.000000, time:1.000000, nonconst if:(op4 changed) calls: While the outer function is: IPA function summary for mainGtU/30 inlinable global time: 29.125000 self size: 36 global size: 36 min size: 16 self stack: 0 global stack: 0 size:15.000000, time:15.000000 size:3.000000, time:2.000000, executed if:(not inlined) size:3.000000, time:3.000000, nonconst if:(op0 changed || op2 changed) size:3.000000, time:3.000000, nonconst if:(op2 changed || op1 changed) size:2.000000, time:2.000000, nonconst if:(op0 changed) size:2.000000, time:2.000000, nonconst if:(op1 changed) calls: mainGtU.part.0/41 function not considered for inlining loop depth: 0 freq:0.12 size: 8 time: 17 callee size:121 stack: 0 with 200 instructions that we think can't be optimized (I am not sure why we do not track accesses to individual block indices). Later we indeed consider mainGtU.part before the split away part: Badness calculation for mainGtU/30 -> mainGtU.part.0/41 size growth 231, time 238.000000 unspec 240.000000 -0.000181: guessed profile. frequency 0.125000, count -1 caller count -1 time w/o inlining 59.125000, time with inlining 56.750000 overall growth -12 (current) -12 (original) -12 (compensated) later we consider the individual parts Estimated badness is -0.000001, frequency 7718.74. Badness calculation for mainSimpleSort/32 -> mainGtU/30 size growth 256, time 54.750000 unspec 56.750000 big_speedup -0.000001: guessed profile. frequency 7718.740543, count -1 caller count -1 time w/o inlining 827687.848633, time with inlining 681031.777832 overall growth 501 (current) 39 (original) 1521 (compensated) Adjusted by hints -0.000001 and inline first one because speedup is considered to be big, but after inlining the function becomes heavy and remaining two are not inlined. There is mis-accounting bug for the time needed for execution of manGtU. I fixed it yesterday for trunk which now has more realistic time estimate for the sequence of ifs: IPA function summary for mainGtU.part.0/41 inlinable global time: 19.766641 self size: 243 global size: 243 min size: 0 self stack: 0 global stack: 0 size:200.000000, time:9.771543 size:4.000000, time:2.004863, executed if:(not inlined) size:10.000000, time:1.998047, nonconst if:(op0 changed) size:10.000000, time:1.998047, nonconst if:(op1 changed) size:9.000000, time:1.996094, nonconst if:(op0 changed || op2 changed) size:9.000000, time:1.996094, nonconst if:(op1 changed || op2 changed) size:1.000000, time:0.001953, nonconst if:(op4 changed) calls: which makes it to be inlined. Does it solve the perofmrance problem for both benchmarks?