14 regression] Wireshark compilation hangs with -O2 -fipa-pta

rguenth at gcc dot gnu.org via Gcc-bugs Wed, 31 May 2023 03:37:20 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109983


--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
Bad:

Samples: 4M of event 'cycles:u', Event count (approx.): 6574968845607           
Overhead       Samples  Command  Shared Object       Symbol                     
  59.44%       2803407  cc1      cc1                 [.] bitmap_ior_into
  23.71%       1119062  cc1      cc1                 [.] bitmap_bit_p
  12.30%        583673  cc1      cc1                 [.] bitmap_set_bit
   1.13%         53698  cc1      cc1                 [.] find
   1.09%         51770  cc1      cc1                 [.] solve_constraints
   0.55%         26265  cc1      cc1                 [.] add_graph_edge
   0.52%         24726  cc1      cc1                 [.] solve_add_graph_edge
   0.34%         16328  cc1      cc1                 [.]
find_what_var_points_to
   0.23%         11259  cc1      cc1                 [.] solution_set_expand
   0.14%          7015  cc1      cc1                 [.] ipa_pta_execute
   0.13%          6296  cc1      cc1                 [.] topo_visit
   0.06%          2708  cc1      cc1                 [.] auto_var_in_fn_p

        1157768.85 msec task-clock:u                     #    1.000 CPUs
utilized          
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
            712310      page-faults:u                    #  615.244 /sec        
     6492736607297      cycles:u                         #    5.608 GHz        
             (83.33%)
         629989262      stalled-cycles-frontend:u        #    0.01% frontend
cycles idle     (83.33%)
        1079567639      stalled-cycles-backend:u         #    0.02% backend
cycles idle      (83.33%)
     3618200824104      instructions:u                   #    0.56  insn per
cycle         
                                                  #    0.00  stalled cycles per
insn  (83.33%)
      976059766707      branches:u                       #  843.052 M/sec      
             (83.33%)
        2658800911      branch-misses:u                  #    0.27% of all
branches          (83.33%)

    1158.070038218 seconds time elapsed

    1157.021441000 seconds user
       0.743804000 seconds sys

Samples: 4M of event 'cache-misses:u', Event count (approx.): 30071477326       
Overhead       Samples  Command  Shared Object       Symbol                     
  67.54%       2723606  cc1      cc1                 [.] bitmap_ior_into
  16.50%        994335  cc1      cc1                 [.] bitmap_bit_p
   3.33%        121544  cc1      cc1                 [.] bitmap_set_bit


Good:

Samples: 390K of event 'cycles:u', Event count (approx.): 532564428833          
Overhead       Samples  Command  Shared Object       Symbol                     
  73.09%        280949  cc1      cc1                 [.] bitmap_set_bit
   3.70%         14919  cc1      cc1                 [.] find
   3.54%         13591  cc1      cc1                 [.]
find_what_var_points_to
   3.48%         13952  cc1      cc1                 [.] bitmap_bit_p
   2.26%          8969  cc1      cc1                 [.] bitmap_ior_into
   2.24%          9027  cc1      cc1                 [.] add_graph_edge
   2.03%          8176  cc1      cc1                 [.] solve_add_graph_edge
   1.75%          7072  cc1      cc1                 [.] solution_set_expand
   1.41%          5620  cc1      cc1                 [.] ipa_pta_execute
   1.34%          5406  cc1      cc1                 [.] solve_constraints

          98853.84 msec task-clock:u                     #    1.000 CPUs
utilized          
                 0      context-switches:u               #    0.000 /sec        
                 0      cpu-migrations:u                 #    0.000 /sec        
            558875      page-faults:u                    #    5.654 K/sec       
      533990826593      cycles:u                         #    5.402 GHz        
             (83.33%)
         387751641      stalled-cycles-frontend:u        #    0.07% frontend
cycles idle     (83.33%)
         236682949      stalled-cycles-backend:u         #    0.04% backend
cycles idle      (83.33%)
     1084066822948      instructions:u                   #    2.03  insn per
cycle         
                                                  #    0.00  stalled cycles per
insn  (83.33%)
      276015818957      branches:u                       #    2.792 G/sec      
             (83.34%)
         899265351      branch-misses:u                  #    0.33% of all
branches          (83.33%)

      98.878776190 seconds time elapsed

      98.459437000 seconds user
       0.395912000 seconds sys

Samples: 140K of event 'cache-misses:u', Event count (approx.): 4272440656      
Overhead       Samples  Command  Shared Object       Symbol                     
  17.92%          6124  cc1      cc1                 [.] ipa_pta_execute
  11.93%         28566  cc1      cc1                 [.] bitmap_set_bit
  10.48%         14489  cc1      cc1                 [.] find


But the key observation is from the stats.  In the bad case we have

Points-to Stats:
Total vars:               108033
Non-pointer vars:          59
Statically unified vars:  20147
Dynamically unified vars: 0
Iterations:               4
Number of edges:          141447426
Number of implicit edges: 215676
Points to sets created:15930

and the good case sees

Points-to Stats:
Total vars:               108033
Non-pointer vars:          59
Statically unified vars:  20147
Dynamically unified vars: 0
Iterations:               10
Number of edges:          582729
Number of implicit edges: 215676
Points to sets created:15930

so we have a _lot_ more (242x) edges in the slow case (but also only 4
solver iterations) while patched we somehow do with less edges (and
thus way less propagations aka bitmap_ior) but need 2.5x the number of
solver iterations.

The final points-to solutions are 1:1 identical (semantically, I did
not look at the ESCAPED set here).

The key is the heuristic in add_graph_edge triggers way more often:

      /* The graph solving process does not avoid "triangles", thus
         there can be multiple paths from a node to another involving
         intermediate other nodes.  That causes extra copying which is
         most difficult to avoid when the intermediate node is ESCAPED
         because there are no edges added from ESCAPED.  Avoid
         adding the direct edge FROM -> TO when we have FROM -> ESCAPED
         and TO contains ESCAPED. 
         ???  Note this is only a heuristic, it does not prevent the
         situation from occuring.  The heuristic helps PR38474 and
         PR99912 significantly.  */
      if (to < FIRST_REF_NODE
          && bitmap_bit_p (graph->succs[from], find (escaped_id))
          && bitmap_bit_p (get_varinfo (find (to))->solution, escaped_id))
        {
          stats.num_avoided_edges++;
          return false;
        }

I've added num_avoided_edges and for the fast case we see

Number of avoided edges: 870215056 (that's 6x the number of edges in the slow
case!)

For the slow case we have

Number of avoided edges: 1113769444

which is even more.  But I think that should be the only reason the
number of edges created differs.  Converging more slowly means
executing ESCAPED = *ESCAPED more often which creates edges to
escaped earlier and thus possibly triggering the above.  But it
also means the 'to' solution might not yet contain ESCAPED ...

ICK.  I can get it to

 ipa points-to                      :   0.40 ( 14%)   0.03 (  5%)   0.43 ( 12%)
 1950k (  1%)
 TOTAL                              :   2.87          0.61          3.49       
  253M

when I avoid defering copies to ESCAPED itself (that makes the first
ESCAPED = *ESCAPED do more but delays increasing other solutions
to followup iterations).  In fact it probably special-cases the
ESCAPED = *ESCAPED complex constraint which we also locally iterate
until convergence before visiting other constraints.

Note the initial graph tends to be very unconnected thus the order
is somewhat random but the way we compute the topological sorting
and process the worklist ensures ESCAPED = *ESCAPED is done as something
very last in the first iteration (if there are no existing edges to
ESCAPED - I think we do have those for calls though).

It does help to order the ESCAPED connected component first:

Points-to Stats:
Total vars:               108033
Non-pointer vars:          59
Statically unified vars:  20147
Dynamically unified vars: 0
Iterations:               3
Number of edges:          239596
Number of implicit edges: 215676
Points to sets created:15930

 ipa points-to                      :   0.28 ( 10%)   0.05 (  8%)   0.34 ( 10%)
 2191k (  1%)
 TOTAL                              :   2.85          0.60          3.47       
  253M

I'm testing this approach now.

[Bug ipa/109983] [12/13/14 regression] Wireshark compilation hangs with -O2 -fipa-pta

Reply via email to