On Mar 20, 2007, at 1:07 AM, Simon Brenner wrote:
I propose to implement incremental parsing in C++
Sounds like a multi-person, multi-year project.
We did something like this a while ago, called the compile server.
The idea was to be able to advance through unchanged portions of code
and replay the changes to compile state so as to reduce compilation
time for code in which we've already seen before, either because
we're compiling mostly the same source, or using a header file more
than once. It included fine grained dependancy tracking for things
like macros and declarations. It could also replay state for
fragments across translation units as well.
You can view it at branches/compile-server-branch.
Basically, a diff algorithm is run on the character or token stream,
We ran it at the character level.
producing a list of insertions, modifications, and deletions,
We only had two states, unchanged or changed region. Unchanged
replayed the state, changed meant compile just that region as
normal. The idea was that people usually only edit a small number of
regions. A region was defined as the lines between # lines in the
cpp output.
In this project, I wouldn't even try to attack incremental code
generation, but
just run the entire middle/back end on an updated tree.
We were able to do incremental code-gen as well.
A future extension
would be to use the information about which trees have changed to
perform
minimal code generation. This approach obviously limits the
improvement in
compile time, and an important initial part of the project would be
to measure
the time spent in lexing, parsing and everything after parsing to
see what the
potential gain would be.
We saw a 41% speed-up for SimpleText, a 110x peak speedup for
<Carbon.h> and (cstdlib). A C++ Carbon hello world was 91x faster,
peak. C hello world was the same speed. Peak speedups for C 2x,
for C++ 142x.
My implementation would store the tree representation and the token
stream from
a source file in the object file
We kept everything in core.
For starters, I would only consider top-level constructs (i.e. if
any token in
a function, type or global-scope variable has changed, the entire
declaration/
definition would be reparsed),
We handled this case by gluing together regions until the start and
end of a region was at the toplevel.
A number of changes in source affect more than the tree of the
construct itself
(inline functions, function default values, templates etc.. see
Problems below).
In these cases, the initial implementation would just identify the
dangerousness
of the changes, abort, and initiate a complete recompilation of the
file.
We validated state on a fine grained basis. Such checking can be
expensive, as odd as that sounds.
Would it be feasible to construct a dependency map of the tree, to
handle these
cases with minimal recompilation? How large would such a dependency
map have
to be?
We never progressed the project far enough to scale into the, push
1000 projects through the server stage, so we never had to worry
about page faulting and running out of ram, though, there was a
concern that in an already tight 32-bit vm space, the server could
run out of ram and/or vm.
To work well, one would have to check every unique identifier used to
ensure that it was not #defined. I wanted a language extension to
make the guarantee for me so that I would not have to check them all.
For checking, the initial implementation should also provide for a
way to
compare the result of the incremental reparse against a complete
recompilation.
We handled this case by comparing the .s files compiled with the
server against one without.
Some of the information that is saved and updated in the aux file
or object file
is probably the same as what is saved in a GCH file.
:-) We selected an in core database so avoid all the issues associated.
Would incremental update of GCH files be possible/interesting?
Nice question. In software almost anything is possible. Harder to
know if it is useful.
Should this all be integrated into the precompiled header framework?
Nice question.
* Changing a declaration (function arguments, default values), also
affects all
uses of the same declaration.
We did this by the fine grained dependancy tracking.
* Adding and removing a template specialization changes all uses of
the template
after the declaration.
I don't think we handled this case.
* If code inside an inlined function body is changed, all (inlined)
uses of the
function also change.
We did this by the fine grained dependancy tracking.
* What other cases like these have not yet been considered?
There are about 1000 more cases like that. I can begin listing them
if you want. :-) Overload sets, RTX constant pools, debugging
information, exception tables, deprecation and NODE_POISONED,
IDENTIFIER_TYPENAME_P...
* How much space does a serialized token stream take?
gcc -E *.c | wc -c will tell you an approximate number. You then
just * by a constant. I wouldn't worry about it.
* How much time is actually spent in the parser?
Seems like if you do a good job, you should be able to get a
100-200x speedup, though, you're gonna have to do code-gen to get it.
* Future: Incremental code generation
I didn't find this part very hard to do. Harder was going to get
perfect fidelity across all the language features and get enough
people interested in the project to get it into mainline.
I'd be happy to kibitz.