[Bug rtl-optimization/98863] [11 Regression] WRF with LTO consumes a lot of memory in REE, FWPROP and x86 specific passes

richard.sandiford at arm dot com via Gcc-bugs Mon, 08 Feb 2021 01:24:05 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98863


--- Comment #41 from richard.sandiford at arm dot com ---
"rguenther at suse dot de" <gcc-bugzi...@gcc.gnu.org> writes:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98863
>
> --- Comment #40 from rguenther at suse dot de <rguenther at suse dot de> ---
> On Mon, 8 Feb 2021, rsandifo at gcc dot gnu.org wrote:
>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98863
>> 
>> --- Comment #39 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot 
>> gnu.org> ---
>> Just to give an update on this: I have a patch that reduces the
>> amount of memory consumed by fwprop so that it no longer seems
>> to be outlier.  However, it involves doing more bitmap operations.
>> In this testcase we have a larger number of registers that
>> seem to be live but unused across a large region of code,
>> so bitmap ANDs with the live in sets are expensive and hit
>> the worst-case O(nblocks×nregisters).  I'm still trying to find
>> a way of reducing the effect of that.
>
> But isn't this what the RD problem does as well (yeah, DF shows
> up as quite compile-time expensive here), and thus all UD/DU chain
> users suffer from the same issue?
Sure, it certainly isn't specific to the RTL-SSA code :-)
I just think we can do better than my current WIP patch does.

> What I didn't explore further is re-doing the way RD numbers defs
> in the bitmaps with the idea that all defs just used inside a
> single BB are not necessary to be represented (the local problems
> take care of them).  But that of course only helps if there are
> a significant number of such defs (shadowed by later defs of the same
> reg in the same BB) - which usually should be the case.
Yeah.  And I think the problem here is that we have a large
number of non-local defs and uses.  It doesn't look like there
are an excessive number of uses per def, just that the defs are
live across a large region before being used.

> There's extra overhead for re-numbering things of course (but my hope
> was to make the RD problem fit in the cache for a nice speedup...)

Has anyone looked into how we end up in this situation for this
testcase?  E.g. did we make bad inlining decisions?  Or is it just
a natural consequence of the way the source is written?

We should cope with the situation better regardless, but since
extreme cases like this tend to trigger --param limits, it would
be good to avoid getting into the situation too. :-)

FWIW, as far as compile-time goes, the outlier in a release build
seems to be do_rpo_vn.

[Bug rtl-optimization/98863] [11 Regression] WRF with LTO consumes a lot of memory in REE, FWPROP and x86 specific passes

Reply via email to