On Wed, May 12, 2021 at 1:19 AM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Tue, May 11, 2021 at 5:01 PM Segher Boessenkool
> <seg...@kernel.crashing.org> wrote:
> >
> > On Tue, May 04, 2021 at 10:40:38AM +0200, Richard Biener via Gcc wrote:
> > > On Mon, May 3, 2021 at 11:10 PM Andrew Pinski via Gcc <gcc@gcc.gnu.org> 
> > > wrote:
> > > >   I noticed my (highly, -j24) parallel build of GCC is serialized on
> > > > compiling gimple-match.c.  Has anyone looked into splitting this
> > > > generated file into multiple files?
> > >
> > > There were threads about this in the past, yes.  There's the
> > > possibility to use LTO for this as well (also mentioned in those
> > > threads).  Note it's not easy to split in a meaningful way in
> > > genmatch.c
> >
> > But it will have to be handled at least somewhat soon: on not huge
> > parallelism (-j120 for example) building *-match.c takes longer than
> > building everything else in gcc/ together (wallclock time), and it is a
> > huge part of regstrap time (bigger than running all of the testsuite!)
>
> I would classify -j120 as "huge parallelism" ;)  Testing time still
> dominates my builds (with -j24) where bootstrap takes ~20 mins
> but testing another 40.

For me, it is around 1 hour bootstrapping and 1 hour testing.

> Is it building stage2 gimple-match.c that you are worried about?
> (it's built using the -O0 compiled stage1 compiler - but we at
> least should end up using -fno-checking for this build)

Yes.  It takes on the machine I was using 15 minutes to compile
gimple-match.c, dominating the whole time for bootstrapping.
Everything else was done in 1-3 minutes max even.
This is on an aarch64 machine with 24 cores (not threads).

Thanks,
Andrew Pinski

>
> Maybe you can do some experiments - like add
> -fno-inline-functions-called-once and change
> genmatch.c:3766 to split out single uses as well
> (should decrease function sizes).
>
> There's the option to make all functions external in
> gimple-match.c so splitting the file at arbitrary points
> will be possible (directly from genmatch), we'll need
> some internal header with all declarations then
> as well or alternatively some clever logic in
> genmatch to only externalize functions needed from
> mutliple split files.
>
> That said - ideas to reduce the size of the generated
> code are welcome as well.
>
> There's also pattern ordering in match.pd that can
> make a difference because we're honoring
> first-match and thus have to re-start matching from
> outermost on conflicts (most of the time the actual
> oder in match.pd is just random).  If you add -v
> to genmatch then you'll see
>
> /home/rguenther/src/gcc3/gcc/match.pd:6092:10 warning: failed to merge
> decision tree node
>    (cmp (op@3 @0 INTEGER_CST@1) INTEGER_CST@2)
>          ^
> /home/rguenther/src/gcc3/gcc/match.pd:4263:11 warning: with the following
>     (cmp (op @0 REAL_CST@1) REAL_CST@2)
>           ^
> /home/rguenther/src/gcc3/gcc/match.pd:5164:6 warning: because of the
> following which serves as ordering barrier
>  (eq @0 integer_onep)
>      ^
>
> that means that the simple (eq @0 integer_onep) should match after
> 4263 but before 6092
> (only the latter will actually match the same - the former has
> REAL_CST@2 but 5164
> uses a predicate integer_onep).  This causes us to emit three switch
> (code){ case EQ_EXPR: }
> instead of one.
>
> There might be legitimate cases of such order constraints but most of them
> are spurious.  "Fixing" them will also make the matching process faster, but
> it's quite some legwork where moving a pattern can fix one occurance but
> result in new others.
>
> For me building stage3 gimple-match.o (on a fully loaded system.. :/) is
>
> 95.05user 0.42system 1:35.51elapsed 99%CPU (0avgtext+0avgdata
> 929400maxresident)k
> 0inputs+0outputs (0major+393349minor)pagefaults 0swaps
>
> and when I use -Wno-error -flto=24 -flinker-output=nolto-rel -r
>
> 139.95user 1.79system 0:25.92elapsed 546%CPU (0avgtext+0avgdata
> 538852maxresident)k
> 0inputs+0outputs (0major+1139679minor)pagefaults 0swaps
>
> the issue of course is that we can't use this for the stage1 build
> (unless we detect working
> GCC LTO in the host compiler setup).  I suppose those measures show the lower
> bound of what should be possible with splitting up the file (LTO
> splits to 128 pieces),
> so for me it's a 4x speedup in wallclock time despite the overhead of
> LTO which is
> quite noticable.  -fno-checking also makes a dramatic difference for me.
>
> Richard.
>
> >
> > Segher

Reply via email to