https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117922
--- Comment #17 from Christoph Müllner <cmuellner at gcc dot gnu.org> --- I reproduced the slow-down with a recent master on a 5950X: * no-mem-fold-offset: 4m58.226s * mem-fold-offset: 11m19.311s (+127%) More details from -ftime-report: * no-mem-fold-offset: df reaching defs : 9.34 ( 3%) 0 ( 0%) * mem-fold-offset: df reaching defs : 381.40 ( 55%) 0 ( 0%) A look at the detailed time report (-ftime-report -ftime-report-details) shows: Time variable wall GGC [...] phase opt and generate : 682.81 ( 99%) 6175M ( 97%) [...] callgraph functions expansion : 646.99 ( 94%) 5695M ( 89%) [...] fold mem offsets : 1.73 ( 0%) 679k ( 0%) `- CFG verifier : 2.10 ( 0%) 0 ( 0%) `- df use-def / def-use chains : 2.32 ( 0%) 0 ( 0%) `- df reaching defs : 370.68 ( 54%) 0 ( 0%) `- verify RTL sharing : 0.05 ( 0%) 0 ( 0%) [...] TOTAL : 690.06 6365M I read this as "fold mem offset utilizes 0% of memory", so there is no issue with the memory footprint. To confirm this, `time -v` was used: * no-mem-fold-offset: Maximum resident set size (kbytes): 15563684 * mem-fold-offset: Maximum resident set size (kbytes): 15564364 I looked at the pass, and a few things could be cleaned up in the pass itself (e.g., redundant calls). However, that won't change anything in the observed performance. The time-consuming part is UD+DU DF analysis for the whole function. Even if the pass would "return 0" right after doing nothing but the analysis, we end up with the same run time (confirmed by measurement). The pass operates on BB-granularity, so DF analysis of the whole function provides more information than needed. When going through the documentation, I came across df_set_blocks(), which I expected to reduce the problem significantly. So, I moved the df_analyse() call into the FOR_ALL_BB_FN() loop, right after a call to df_set_blocks(), with the intent to only have a single block set per iteration. However, that triggered a few ICEs in DF, and once they were bypassed, ended up in practical non-termination (i.e. the calls to df_analyse() won't get significantly cheaper by df_set_blocks()). My conclusion: This can only be fixed by not using DF analysis and implementing a pass-specific analysis. So far, I have not found a good solution for this. But I haven't looked at all the suggestions in detail. Can someone help me find what Paolo referenced as "the multiple definitions DF problem that was introduced for fwprop in 2009"?