https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85103

Jan Hubicka <hubicka at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |WAITING

--- Comment #18 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
I see, I can reproduce it when I remove __inline__ in the function and build
with O3.  What happens is the following.   MainGtU is quite large hand unrolled
implementation of N^2logN sorting

Bool mainGtU ( UInt32  i1, 
               UInt32  i2,
               UChar*  block, 
               UInt16* quadrant,
               UInt32  nblock,
               Int32*  budget )
{
   Int32  k;
   UChar  c1, c2;
   UInt16 s1, s2;

   AssertD ( i1 != i2, "mainGtU" );
   /* 1 */
   c1 = block[i1]; c2 = block[i2];
   if (c1 != c2) return (c1 > c2);
   i1++; i2++;
   /* 2 */
   c1 = block[i1]; c2 = block[i2];
   if (c1 != c2) return (c1 > c2);
   i1++; i2++;
   /* 3 */
   c1 = block[i1]; c2 = block[i2];
   if (c1 != c2) return (c1 > c2);
   i1++; i2++;
   /* 4 */
   c1 = block[i1]; c2 = block[i2];
   if (c1 != c2) return (c1 > c2);
   i1++; i2++;
...

we decide to split it after first 5 conditionals for some reason.
The partial function is large
IPA function summary for mainGtU.part.0/41 inlinable
  global time:     240.000000
  self size:       243
  global size:     243
  min size:       0
  self stack:      0
  global stack:    0
    size:200.000000, time:199.000000
    size:4.000000, time:2.000000,  executed if:(not inlined)
    size:10.000000, time:10.000000,  nonconst if:(op0 changed)
    size:10.000000, time:10.000000,  nonconst if:(op1 changed)
    size:9.000000, time:9.000000,  nonconst if:(op0 changed || op2 changed)
    size:9.000000, time:9.000000,  nonconst if:(op1 changed || op2 changed)
    size:1.000000, time:1.000000,  nonconst if:(op4 changed)
  calls:

While the outer function is:

IPA function summary for mainGtU/30 inlinable
  global time:     29.125000
  self size:       36
  global size:     36
  min size:       16
  self stack:      0
  global stack:    0
    size:15.000000, time:15.000000
    size:3.000000, time:2.000000,  executed if:(not inlined)
    size:3.000000, time:3.000000,  nonconst if:(op0 changed || op2 changed)
    size:3.000000, time:3.000000,  nonconst if:(op2 changed || op1 changed)
    size:2.000000, time:2.000000,  nonconst if:(op0 changed)
    size:2.000000, time:2.000000,  nonconst if:(op1 changed)
  calls:
    mainGtU.part.0/41 function not considered for inlining
      loop depth: 0 freq:0.12 size: 8 time: 17 callee size:121 stack: 0

with 200 instructions that we think can't be optimized (I am not sure why we do
not track accesses to individual block indices).

Later we indeed consider mainGtU.part before the split away part:
    Badness calculation for mainGtU/30 -> mainGtU.part.0/41
      size growth 231, time 238.000000 unspec 240.000000 
      -0.000181: guessed profile. frequency 0.125000, count -1 caller count -1
time w/o inlining 59.125000, time with inlining 56.750000 overall growth -12
(current) -12 (original) -12 (compensated)

later we consider the individual parts

 Estimated badness is -0.000001, frequency 7718.74.
    Badness calculation for mainSimpleSort/32 -> mainGtU/30
      size growth 256, time 54.750000 unspec 56.750000  big_speedup
      -0.000001: guessed profile. frequency 7718.740543, count -1 caller count
-1 time w/o inlining 827687.848633, time with inlining 681031.777832 overall
growth 501 (current) 39 (original) 1521 (compensated)
      Adjusted by hints -0.000001

and inline first one because speedup is considered to be big, but after
inlining the function becomes heavy and remaining two are not inlined.

There is mis-accounting bug for the time needed for execution of manGtU. I
fixed it yesterday for trunk which now has more realistic time estimate for the
sequence of ifs:
IPA function summary for mainGtU.part.0/41 inlinable                            
  global time:     19.766641                                                    
  self size:       243                                                          
  global size:     243                                                          
  min size:       0                                                             
  self stack:      0                                                            
  global stack:    0                                                            
    size:200.000000, time:9.771543                                              
    size:4.000000, time:2.004863,  executed if:(not inlined)                    
    size:10.000000, time:1.998047,  nonconst if:(op0 changed)                   
    size:10.000000, time:1.998047,  nonconst if:(op1 changed)                   
    size:9.000000, time:1.996094,  nonconst if:(op0 changed || op2 changed)     
    size:9.000000, time:1.996094,  nonconst if:(op1 changed || op2 changed)     
    size:1.000000, time:0.001953,  nonconst if:(op4 changed)                    
  calls:                                                                        

which makes it to be inlined. Does it solve the perofmrance problem for both
benchmarks?

Reply via email to