Richard Biener <rguent...@suse.de> writes: > On Mon, 19 Feb 2024, Richard Sandiford wrote: > >> Richard Biener <rguent...@suse.de> writes: >> > The following tries to address the PHI insertion compile-time hog in >> > RTL fwprop observed with the PR54052 testcase where the loop computing >> > the "unfiltered" set of variables possibly needing PHI nodes for each >> > block exhibits quadratic compile-time and memory-use. >> > >> > Instead of only pruning the set of candidate regs by LR_IN in the >> > second worklist loop do this when computing "unfiltered" already. >> > >> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. >> > >> > I'll note that in PR98863 you say in comment#39 >> > >> > "Just to give an update on this: I have a patch that reduces the >> > amount of memory consumed by fwprop so that it no longer seems >> > to be outlier. However, it involves doing more bitmap operations. >> > In this testcase we have a larger number of registers that >> > seem to be live but unused across a large region of code, >> > so bitmap ANDs with the live in sets are expensive and hit >> > the worst-case O(nblocksnregisters). I'm still trying to find >> > a way of reducing the effect of that." >> > >> > suggesting that the very AND operation I'm introducing below >> > was an actual problem. It's just not very clear what testcase >> > this was on (the PR hasn't one, it just talks about WRF with LTO >> > and then some individual TUs of it). >> >> Yeah, like you say, I think this kind of AND was exactly the problem. >> If the DEF set is much smaller than the IN set, we can spend a lot of >> compile time (and cache) iterating over the leading elements of the >> IN set. So this could be trading one hog for another. >> >> Could we use some heuristic to choose between the two? If the IN set >> is "sensible", do the AND, otherwise keep it as-is? > > Not sure how, I don't think DF caches the set sizes (we could > compute them, of course). But I just made an experiment and > using DF_LR_OUT instead of DF_LR_BB_INFO->def improves compile-time > as well. So incremental ontop of the posted: > > diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc > index 9d1cd1b0365..6fd08602c1b 100644 > --- a/gcc/rtl-ssa/blocks.cc > +++ b/gcc/rtl-ssa/blocks.cc > @@ -645,12 +645,11 @@ function_info::place_phis (build_info &bi) > if (bitmap_empty_p (&frontiers[b1])) > continue; > > - bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, > b1))->def; > + bitmap b1_def = DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1)); > bitmap_iterator bmi; > unsigned int b2; > EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi) > - if (bitmap_ior_and_into (&unfiltered[b2], b1_def, > - DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2))) > + if (bitmap_ior_into (&unfiltered[b2], b1_def) > && !bitmap_empty_p (&frontiers[b2])) > // Propagate the (potential) new phi node definitions in B2. > bitmap_set_bit (worklist, b2); > > of course that's too big (including live-through), but we could > prune by DF_LR_OUT like > > diff --git a/gcc/rtl-ssa/blocks.cc b/gcc/rtl-ssa/blocks.cc > index 9d1cd1b0365..6a4dd05908f 100644 > --- a/gcc/rtl-ssa/blocks.cc > +++ b/gcc/rtl-ssa/blocks.cc > @@ -645,12 +645,13 @@ function_info::place_phis (build_info &bi) > if (bitmap_empty_p (&frontiers[b1])) > continue; > > - bitmap b1_def = &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, > b1))->def; > + auto_bitmap b1_def; > + bitmap_and (b1_def, &DF_LR_BB_INFO (BASIC_BLOCK_FOR_FN (m_fn, > b1))->def, > + DF_LR_OUT (BASIC_BLOCK_FOR_FN (m_fn, b1))); > bitmap_iterator bmi; > unsigned int b2; > EXECUTE_IF_SET_IN_BITMAP (&frontiers[b1], 0, b2, bmi) > - if (bitmap_ior_and_into (&unfiltered[b2], b1_def, > - DF_LR_IN (BASIC_BLOCK_FOR_FN (m_fn, b2))) > + if (bitmap_ior_into (&unfiltered[b2], b1_def) > && !bitmap_empty_p (&frontiers[b2])) > // Propagate the (potential) new phi node definitions in B2. > bitmap_set_bit (worklist, b2); > > so for the testcase it seems we have a lot of local defined but > not "global" used defs. > > Would that work for you or am I missing something?
I suppose that's better than the first version when a block has a large number of dominance frontiers. But I can't remember whether that was the case in PR98863. I have a feeling that I tried the above as part of the PR, since it's the obvious way of applying liveness once per block rather than once per edge. But I think it was more just sheer weight of numbers. I wonder whether we could create a new DEF bitmap on-the-fly while creating the defs and uses, restricting it to potential PHI registers. The test for potential PHI registers is constant time, and we could build the bitmaps as tree views before converting to list views. Can give it a go if that sounds OK. Thanks, Richard