[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

cmuellner at gcc dot gnu.org via Gcc-bugs Wed, 29 Jan 2025 00:43:10 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922


--- Comment #17 from Christoph Müllner <cmuellner at gcc dot gnu.org> ---
I reproduced the slow-down with a recent master on a 5950X:
* no-mem-fold-offset: 4m58.226s
* mem-fold-offset: 11m19.311s (+127%)

More details from -ftime-report:
* no-mem-fold-offset: df reaching defs   :   9.34 (  3%)     0  (  0%)
* mem-fold-offset: df reaching defs      : 381.40 ( 55%)     0  (  0%)

A look at the detailed time report (-ftime-report -ftime-report-details) shows:

Time variable                                  wall           GGC
[...]
 phase opt and generate             : 682.81 ( 99%)  6175M ( 97%)
 [...]
 callgraph functions expansion      : 646.99 ( 94%)  5695M ( 89%)
[...]
 fold mem offsets                   :   1.73 (  0%)   679k (  0%)
 `- CFG verifier                    :   2.10 (  0%)     0  (  0%)
 `- df use-def / def-use chains     :   2.32 (  0%)     0  (  0%)
 `- df reaching defs                : 370.68 ( 54%)     0  (  0%)
 `- verify RTL sharing              :   0.05 (  0%)     0  (  0%)
[...]
 TOTAL                              : 690.06         6365M

I read this as "fold mem offset utilizes 0% of memory", so there is no issue
with the memory footprint.

To confirm this, `time -v` was used:
* no-mem-fold-offset: Maximum resident set size (kbytes): 15563684
* mem-fold-offset: Maximum resident set size (kbytes): 15564364

I looked at the pass, and a few things could be cleaned up in the pass itself
(e.g., redundant calls). However, that won't change anything in the observed
performance.
The time-consuming part is UD+DU DF analysis for the whole function.
Even if the pass would "return 0" right after doing nothing but the analysis,
we end up with the same run time (confirmed by measurement).

The pass operates on BB-granularity, so DF analysis of the whole function
provides more information than needed. When going through the documentation, I
came across df_set_blocks(), which I expected to reduce the problem
significantly.
So, I moved the df_analyse() call into the FOR_ALL_BB_FN() loop, right after a
call to df_set_blocks(), with the intent to only have a single block set per
iteration.
However, that triggered a few ICEs in DF, and once they were bypassed, ended up
in practical non-termination (i.e. the calls to df_analyse() won't get
significantly cheaper by df_set_blocks()).

My conclusion:
This can only be fixed by not using DF analysis and implementing a
pass-specific analysis.

So far, I have not found a good solution for this. But I haven't looked at all
the suggestions in detail. Can someone help me find what Paolo referenced as
"the multiple definitions DF problem that was introduced for fwprop in 2009"?

[Bug rtl-optimization/117922] [15 Regression] 1000% compilation time slow down on the testcase from pr26854

Reply via email to