On Wed, May 12, 2021 at 1:19 AM Richard Biener <richard.guent...@gmail.com> wrote: > > On Tue, May 11, 2021 at 5:01 PM Segher Boessenkool > <seg...@kernel.crashing.org> wrote: > > > > On Tue, May 04, 2021 at 10:40:38AM +0200, Richard Biener via Gcc wrote: > > > On Mon, May 3, 2021 at 11:10 PM Andrew Pinski via Gcc <gcc@gcc.gnu.org> > > > wrote: > > > > I noticed my (highly, -j24) parallel build of GCC is serialized on > > > > compiling gimple-match.c. Has anyone looked into splitting this > > > > generated file into multiple files? > > > > > > There were threads about this in the past, yes. There's the > > > possibility to use LTO for this as well (also mentioned in those > > > threads). Note it's not easy to split in a meaningful way in > > > genmatch.c > > > > But it will have to be handled at least somewhat soon: on not huge > > parallelism (-j120 for example) building *-match.c takes longer than > > building everything else in gcc/ together (wallclock time), and it is a > > huge part of regstrap time (bigger than running all of the testsuite!) > > I would classify -j120 as "huge parallelism" ;) Testing time still > dominates my builds (with -j24) where bootstrap takes ~20 mins > but testing another 40.
For me, it is around 1 hour bootstrapping and 1 hour testing. > Is it building stage2 gimple-match.c that you are worried about? > (it's built using the -O0 compiled stage1 compiler - but we at > least should end up using -fno-checking for this build) Yes. It takes on the machine I was using 15 minutes to compile gimple-match.c, dominating the whole time for bootstrapping. Everything else was done in 1-3 minutes max even. This is on an aarch64 machine with 24 cores (not threads). Thanks, Andrew Pinski > > Maybe you can do some experiments - like add > -fno-inline-functions-called-once and change > genmatch.c:3766 to split out single uses as well > (should decrease function sizes). > > There's the option to make all functions external in > gimple-match.c so splitting the file at arbitrary points > will be possible (directly from genmatch), we'll need > some internal header with all declarations then > as well or alternatively some clever logic in > genmatch to only externalize functions needed from > mutliple split files. > > That said - ideas to reduce the size of the generated > code are welcome as well. > > There's also pattern ordering in match.pd that can > make a difference because we're honoring > first-match and thus have to re-start matching from > outermost on conflicts (most of the time the actual > oder in match.pd is just random). If you add -v > to genmatch then you'll see > > /home/rguenther/src/gcc3/gcc/match.pd:6092:10 warning: failed to merge > decision tree node > (cmp (op@3 @0 INTEGER_CST@1) INTEGER_CST@2) > ^ > /home/rguenther/src/gcc3/gcc/match.pd:4263:11 warning: with the following > (cmp (op @0 REAL_CST@1) REAL_CST@2) > ^ > /home/rguenther/src/gcc3/gcc/match.pd:5164:6 warning: because of the > following which serves as ordering barrier > (eq @0 integer_onep) > ^ > > that means that the simple (eq @0 integer_onep) should match after > 4263 but before 6092 > (only the latter will actually match the same - the former has > REAL_CST@2 but 5164 > uses a predicate integer_onep). This causes us to emit three switch > (code){ case EQ_EXPR: } > instead of one. > > There might be legitimate cases of such order constraints but most of them > are spurious. "Fixing" them will also make the matching process faster, but > it's quite some legwork where moving a pattern can fix one occurance but > result in new others. > > For me building stage3 gimple-match.o (on a fully loaded system.. :/) is > > 95.05user 0.42system 1:35.51elapsed 99%CPU (0avgtext+0avgdata > 929400maxresident)k > 0inputs+0outputs (0major+393349minor)pagefaults 0swaps > > and when I use -Wno-error -flto=24 -flinker-output=nolto-rel -r > > 139.95user 1.79system 0:25.92elapsed 546%CPU (0avgtext+0avgdata > 538852maxresident)k > 0inputs+0outputs (0major+1139679minor)pagefaults 0swaps > > the issue of course is that we can't use this for the stage1 build > (unless we detect working > GCC LTO in the host compiler setup). I suppose those measures show the lower > bound of what should be possible with splitting up the file (LTO > splits to 128 pieces), > so for me it's a 4x speedup in wallclock time despite the overhead of > LTO which is > quite noticable. -fno-checking also makes a dramatic difference for me. > > Richard. > > > > > Segher