https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109983
--- Comment #12 from Richard Biener <rguenth at gcc dot gnu.org> ---
Bad:
Samples: 4M of event 'cycles:u', Event count (approx.): 6574968845607
Overhead Samples Command Shared Object Symbol
59.44% 2803407 cc1 cc1 [.] bitmap_ior_into
23.71% 1119062 cc1 cc1 [.] bitmap_bit_p
12.30% 583673 cc1 cc1 [.] bitmap_set_bit
1.13% 53698 cc1 cc1 [.] find
1.09% 51770 cc1 cc1 [.] solve_constraints
0.55% 26265 cc1 cc1 [.] add_graph_edge
0.52% 24726 cc1 cc1 [.] solve_add_graph_edge
0.34% 16328 cc1 cc1 [.]
find_what_var_points_to
0.23% 11259 cc1 cc1 [.] solution_set_expand
0.14% 7015 cc1 cc1 [.] ipa_pta_execute
0.13% 6296 cc1 cc1 [.] topo_visit
0.06% 2708 cc1 cc1 [.] auto_var_in_fn_p
1157768.85 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
712310 page-faults:u # 615.244 /sec
6492736607297 cycles:u # 5.608 GHz
(83.33%)
629989262 stalled-cycles-frontend:u # 0.01% frontend
cycles idle (83.33%)
1079567639 stalled-cycles-backend:u # 0.02% backend
cycles idle (83.33%)
3618200824104 instructions:u # 0.56 insn per
cycle
# 0.00 stalled cycles per
insn (83.33%)
976059766707 branches:u # 843.052 M/sec
(83.33%)
2658800911 branch-misses:u # 0.27% of all
branches (83.33%)
1158.070038218 seconds time elapsed
1157.021441000 seconds user
0.743804000 seconds sys
Samples: 4M of event 'cache-misses:u', Event count (approx.): 30071477326
Overhead Samples Command Shared Object Symbol
67.54% 2723606 cc1 cc1 [.] bitmap_ior_into
16.50% 994335 cc1 cc1 [.] bitmap_bit_p
3.33% 121544 cc1 cc1 [.] bitmap_set_bit
Good:
Samples: 390K of event 'cycles:u', Event count (approx.): 532564428833
Overhead Samples Command Shared Object Symbol
73.09% 280949 cc1 cc1 [.] bitmap_set_bit
3.70% 14919 cc1 cc1 [.] find
3.54% 13591 cc1 cc1 [.]
find_what_var_points_to
3.48% 13952 cc1 cc1 [.] bitmap_bit_p
2.26% 8969 cc1 cc1 [.] bitmap_ior_into
2.24% 9027 cc1 cc1 [.] add_graph_edge
2.03% 8176 cc1 cc1 [.] solve_add_graph_edge
1.75% 7072 cc1 cc1 [.] solution_set_expand
1.41% 5620 cc1 cc1 [.] ipa_pta_execute
1.34% 5406 cc1 cc1 [.] solve_constraints
98853.84 msec task-clock:u # 1.000 CPUs
utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
558875 page-faults:u # 5.654 K/sec
533990826593 cycles:u # 5.402 GHz
(83.33%)
387751641 stalled-cycles-frontend:u # 0.07% frontend
cycles idle (83.33%)
236682949 stalled-cycles-backend:u # 0.04% backend
cycles idle (83.33%)
1084066822948 instructions:u # 2.03 insn per
cycle
# 0.00 stalled cycles per
insn (83.33%)
276015818957 branches:u # 2.792 G/sec
(83.34%)
899265351 branch-misses:u # 0.33% of all
branches (83.33%)
98.878776190 seconds time elapsed
98.459437000 seconds user
0.395912000 seconds sys
Samples: 140K of event 'cache-misses:u', Event count (approx.): 4272440656
Overhead Samples Command Shared Object Symbol
17.92% 6124 cc1 cc1 [.] ipa_pta_execute
11.93% 28566 cc1 cc1 [.] bitmap_set_bit
10.48% 14489 cc1 cc1 [.] find
But the key observation is from the stats. In the bad case we have
Points-to Stats:
Total vars: 108033
Non-pointer vars: 59
Statically unified vars: 20147
Dynamically unified vars: 0
Iterations: 4
Number of edges: 141447426
Number of implicit edges: 215676
Points to sets created:15930
and the good case sees
Points-to Stats:
Total vars: 108033
Non-pointer vars: 59
Statically unified vars: 20147
Dynamically unified vars: 0
Iterations: 10
Number of edges: 582729
Number of implicit edges: 215676
Points to sets created:15930
so we have a _lot_ more (242x) edges in the slow case (but also only 4
solver iterations) while patched we somehow do with less edges (and
thus way less propagations aka bitmap_ior) but need 2.5x the number of
solver iterations.
The final points-to solutions are 1:1 identical (semantically, I did
not look at the ESCAPED set here).
The key is the heuristic in add_graph_edge triggers way more often:
/* The graph solving process does not avoid "triangles", thus
there can be multiple paths from a node to another involving
intermediate other nodes. That causes extra copying which is
most difficult to avoid when the intermediate node is ESCAPED
because there are no edges added from ESCAPED. Avoid
adding the direct edge FROM -> TO when we have FROM -> ESCAPED
and TO contains ESCAPED.
??? Note this is only a heuristic, it does not prevent the
situation from occuring. The heuristic helps PR38474 and
PR99912 significantly. */
if (to < FIRST_REF_NODE
&& bitmap_bit_p (graph->succs[from], find (escaped_id))
&& bitmap_bit_p (get_varinfo (find (to))->solution, escaped_id))
{
stats.num_avoided_edges++;
return false;
}
I've added num_avoided_edges and for the fast case we see
Number of avoided edges: 870215056 (that's 6x the number of edges in the slow
case!)
For the slow case we have
Number of avoided edges: 1113769444
which is even more. But I think that should be the only reason the
number of edges created differs. Converging more slowly means
executing ESCAPED = *ESCAPED more often which creates edges to
escaped earlier and thus possibly triggering the above. But it
also means the 'to' solution might not yet contain ESCAPED ...
ICK. I can get it to
ipa points-to : 0.40 ( 14%) 0.03 ( 5%) 0.43 ( 12%)
1950k ( 1%)
TOTAL : 2.87 0.61 3.49
253M
when I avoid defering copies to ESCAPED itself (that makes the first
ESCAPED = *ESCAPED do more but delays increasing other solutions
to followup iterations). In fact it probably special-cases the
ESCAPED = *ESCAPED complex constraint which we also locally iterate
until convergence before visiting other constraints.
Note the initial graph tends to be very unconnected thus the order
is somewhat random but the way we compute the topological sorting
and process the worklist ensures ESCAPED = *ESCAPED is done as something
very last in the first iteration (if there are no existing edges to
ESCAPED - I think we do have those for calls though).
It does help to order the ESCAPED connected component first:
Points-to Stats:
Total vars: 108033
Non-pointer vars: 59
Statically unified vars: 20147
Dynamically unified vars: 0
Iterations: 3
Number of edges: 239596
Number of implicit edges: 215676
Points to sets created:15930
ipa points-to : 0.28 ( 10%) 0.05 ( 8%) 0.34 ( 10%)
2191k ( 1%)
TOTAL : 2.85 0.60 3.47
253M
I'm testing this approach now.