On Mar 20, 2007, at 1:07 AM, Simon Brenner wrote:
I propose to implement incremental parsing in C++

Sounds like a multi-person, multi-year project.

We did something like this a while ago, called the compile server. The idea was to be able to advance through unchanged portions of code and replay the changes to compile state so as to reduce compilation time for code in which we've already seen before, either because we're compiling mostly the same source, or using a header file more than once. It included fine grained dependancy tracking for things like macros and declarations. It could also replay state for fragments across translation units as well.

You can view it at branches/compile-server-branch.

Basically, a diff algorithm is run on the character or token stream,

We ran it at the character level.

producing a list of insertions, modifications, and deletions,

We only had two states, unchanged or changed region. Unchanged replayed the state, changed meant compile just that region as normal. The idea was that people usually only edit a small number of regions. A region was defined as the lines between # lines in the cpp output.

In this project, I wouldn't even try to attack incremental code generation, but
just run the entire middle/back end on an updated tree.

We were able to do incremental code-gen as well.

A future extension
would be to use the information about which trees have changed to perform minimal code generation. This approach obviously limits the improvement in compile time, and an important initial part of the project would be to measure the time spent in lexing, parsing and everything after parsing to see what the
potential gain would be.

We saw a 41% speed-up for SimpleText, a 110x peak speedup for <Carbon.h> and (cstdlib). A C++ Carbon hello world was 91x faster, peak. C hello world was the same speed. Peak speedups for C 2x, for C++ 142x.

My implementation would store the tree representation and the token stream from
a source file in the object file

We kept everything in core.

For starters, I would only consider top-level constructs (i.e. if any token in a function, type or global-scope variable has changed, the entire declaration/
definition would be reparsed),

We handled this case by gluing together regions until the start and end of a region was at the toplevel.

A number of changes in source affect more than the tree of the construct itself (inline functions, function default values, templates etc.. see Problems below). In these cases, the initial implementation would just identify the dangerousness of the changes, abort, and initiate a complete recompilation of the file.

We validated state on a fine grained basis. Such checking can be expensive, as odd as that sounds.

Would it be feasible to construct a dependency map of the tree, to handle these cases with minimal recompilation? How large would such a dependency map have
to be?

We never progressed the project far enough to scale into the, push 1000 projects through the server stage, so we never had to worry about page faulting and running out of ram, though, there was a concern that in an already tight 32-bit vm space, the server could run out of ram and/or vm.

To work well, one would have to check every unique identifier used to ensure that it was not #defined. I wanted a language extension to make the guarantee for me so that I would not have to check them all.

For checking, the initial implementation should also provide for a way to compare the result of the incremental reparse against a complete recompilation.

We handled this case by comparing the .s files compiled with the server against one without.

Some of the information that is saved and updated in the aux file or object file
is probably the same as what is saved in a GCH file.

:-)  We selected an in core database so avoid all the issues associated.

Would incremental update of GCH files be possible/interesting?

Nice question. In software almost anything is possible. Harder to know if it is useful.

Should this all be integrated into the precompiled header framework?

Nice question.

* Changing a declaration (function arguments, default values), also affects all
uses of the same declaration.

We did this by the fine grained dependancy tracking.

* Adding and removing a template specialization changes all uses of the template
after the declaration.

I don't think we handled this case.

* If code inside an inlined function body is changed, all (inlined) uses of the
function also change.

We did this by the fine grained dependancy tracking.

* What other cases like these have not yet been considered?

There are about 1000 more cases like that. I can begin listing them if you want. :-) Overload sets, RTX constant pools, debugging information, exception tables, deprecation and NODE_POISONED, IDENTIFIER_TYPENAME_P...

* How much space does a serialized token stream take?

gcc -E *.c | wc -c will tell you an approximate number. You then just * by a constant. I wouldn't worry about it.

* How much time is actually spent in the parser?

Seems like if you do a good job, you should be able to get a 100-200x speedup, though, you're gonna have to do code-gen to get it.

* Future: Incremental code generation

I didn't find this part very hard to do. Harder was going to get perfect fidelity across all the language features and get enough people interested in the project to get it into mainline.

I'd be happy to kibitz.

Reply via email to