Re: SoC Project: Incremental Parsing (of C++)

Simon Brenner Tue, 20 Mar 2007 19:14:05 -0800

Wow, lots of comments there, Mike ;-)


We saw a 41% speed-up for SimpleText, a 110x peak speedup for
<Carbon.h> and (cstdlib).  A C++ Carbon hello world was 91x faster,
peak.   C hello world was the same speed.  Peak speedups for C 2x,
for C++ 142x.


Cool! After some measurements (-ftime-report, tee, grep, cut and perl)
on a C++ codebase (about 200kloc, ~341 files), it seems that about 50%
of compilation time is spent in the parser, which limits the possible
speed-up of my approach to about 2x. Which wouldn't be small, but
still nothing compared to those numbers. Additionally, about 17% is
spent in "name lookup" (I'm not sure here what kinds of name lookups
are actually measured, and how much of it that would be eliminated by
parsing incrementally), and 13% in preprocessing (this time wouldn't
change at all).

> A number of changes in source affect more than the tree of the
> construct itself
> (inline functions, function default values, templates etc.. see
> Problems below).
> In these cases, the initial implementation would just identify the
> dangerousness
> of the changes, abort, and initiate a complete recompilation of the
> file.

We validated state on a fine grained basis.  Such checking can be
expensive, as odd as that sounds.


My idea was to initially just check for any not obviously safe
changes, and later in the project try to determine the most important
kind of changes to handle intelligently. That is, a change would be
considered dangerous until proven safe. But I guess I might find that
almost nothing at all can be done without that fine-grained checking.

Would it be possible to calculate the dependencies from the tree of a
function? If so, the parser could go through all unchanged trees in
their file order and check for a Changed flag in the declarations
referred to, recompiling the tree and marking it as changed if any of
its dependents were changed. This wouldn't properly handle all cases
though... A reference to a function could be annotated with the
overload set and any function added/deleted/changed could mark its
overload set as changed, but that is probably only one of a load of
cases.

(More research certainly needed.)

We never progressed the project far enough to scale into the, push
1000 projects through the server stage, so we never had to worry
about page faulting and running out of ram, though, there was a
concern that in an already tight 32-bit vm space, the server could
run out of ram and/or vm.


Hmm.. Well, this would only be within one compilation unit, but still
I guess you could have a pathological file with 2^16 templates, all
interdependent, in which case you'd have 2^32 dependency links, and
quickly overflow most limits.

Before letting this thing loose on users, there should be code to
detect such a case and restart a normal compilation. But my priority
would be to get the internal stuff working, not being nice to users.
In the worst case, you'd just keep on going naively, crash, and let
the build script retry a non-incremental compile.

To work well, one would have to check every unique identifier used to
ensure that it was not #defined.  I wanted a language extension to
make the guarantee for me so that I would not have to check them all.


That specific problem would be taken care of by working on already
preprocessed data. But the problem is similar to that of overloaded
functions which would need to be considered.

> For checking, the initial implementation should also provide for a
> way to
> compare the result of the incremental reparse against a complete
> recompilation.

We handled this case by comparing the .s files compiled with the
server against one without.

Nice and simple! No need to compare trees then.

There are about 1000 more cases like that.  I can begin listing them
if you want.  :-)  Overload sets, RTX constant pools, debugging
information, exception tables, deprecation and NODE_POISONED,
IDENTIFIER_TYPENAME_P...

Which of these are only of importance when doing code generation?


> * How much space does a serialized token stream take?

gcc -E *.c | wc -c will tell you an approximate number.  You then
just * by a constant.  I wouldn't worry about it.

> * How much time is actually spent in the parser?

Seems like if you do a good job, you should  be able to get a
100-200x speedup, though, you're gonna have to do code-gen to get it.

> * Future: Incremental code generation

I didn't find this part very hard to do.  Harder was going to get
perfect fidelity across all the language features and get enough
people interested in the project to get it into mainline.


How good did you manage to get the compile server in terms of fidelity?

I'm happy to forego the extra gains (although significant) for now, in
exchange for fidelity, i.e. start with "100%" correct code generation
without incrementality and then try to add more incrementality while
keeping it "100%" correct, where "100%" is where the rest of the
compiler is at.

Trying to do all these things at once is probably a contributing
factor to the compile server not being in the mainline. Nevertheless,
I'd like to think of it as a secondary goal of the project to produce
something that *could* be used to drive incremental code generation (I
guess that's changed-flags and, if we generate it, any dependency
information generated), if someone was interested to take on such a
project.

I'd be happy to kibitz.

Happily accepting advice where needed/provided! ;-)


BTW, I found the old compile server paper here:
http://per.bothner.com/papers/GccSummit03/gcc-server.pdf - which was
certainly a multi-person multi-year project. I haven't found much
other info on the compile server project, other than that it was
worked on long ago, and some time since stagnated.

I can certainly see quite a few similarities between this project and
the compile server, but compared to the compile server, my project
proposal is basically to make the compile server simple enough to get
some gains without massive compiler surgery (i.e. by disregarding code
generation and preprocessing, things become simple(r), at least in
theory). My hope is that through keeping it simple, it could become
good enough easily/quickly enough to be useful, whereas the compile
server seemed to aim for the sun and just about never get those wings
to stop melting ;-)

That said, it should be useful to look at the compile server's
solution to a number of problems, esp. with regard to dependency
tracking. Many common problems do exist after all.

Anyways, how far did the compile server go? What was left to do when
development stopped? And why did development stop?

Re: SoC Project: Incremental Parsing (of C++)

Reply via email to