Re: SoC Project: Incremental Parsing (of C++)

Mike Stump Tue, 20 Mar 2007 15:40:47 -0800

On Mar 20, 2007, at 1:07 AM, Simon Brenner wrote:

I propose to implement incremental parsing in C++


Sounds like a multi-person, multi-year project.

We did something like this a while ago, called the compile server.The idea was to be able to advance through unchanged portions of codeand replay the changes to compile state so as to reduce compilationtime for code in which we've already seen before, either becausewe're compiling mostly the same source, or using a header file morethan once. It included fine grained dependancy tracking for thingslike macros and declarations. It could also replay state forfragments across translation units as well.


You can view it at branches/compile-server-branch.

Basically, a diff algorithm is run on the character or token stream,


We ran it at the character level.

producing a list of insertions, modifications, and deletions,

We only had two states, unchanged or changed region. Unchangedreplayed the state, changed meant compile just that region asnormal. The idea was that people usually only edit a small number ofregions. A region was defined as the lines between # lines in thecpp output.

In this project, I wouldn't even try to attack incremental codegeneration, but
just run the entire middle/back end on an updated tree.


We were able to do incremental code-gen as well.

A future extension
would be to use the information about which trees have changed toperformminimal code generation. This approach obviously limits theimprovement incompile time, and an important initial part of the project would beto measurethe time spent in lexing, parsing and everything after parsing tosee what the
potential gain would be.

We saw a 41% speed-up for SimpleText, a 110x peak speedup for<Carbon.h> and (cstdlib). A C++ Carbon hello world was 91x faster,peak. C hello world was the same speed. Peak speedups for C 2x,for C++ 142x.

My implementation would store the tree representation and the tokenstream from
a source file in the object file


We kept everything in core.

For starters, I would only consider top-level constructs (i.e. ifany token ina function, type or global-scope variable has changed, the entiredeclaration/
definition would be reparsed),

We handled this case by gluing together regions until the start andend of a region was at the toplevel.

A number of changes in source affect more than the tree of theconstruct itself(inline functions, function default values, templates etc.. seeProblems below).In these cases, the initial implementation would just identify thedangerousnessof the changes, abort, and initiate a complete recompilation of thefile.

We validated state on a fine grained basis. Such checking can beexpensive, as odd as that sounds.

Would it be feasible to construct a dependency map of the tree, tohandle thesecases with minimal recompilation? How large would such a dependencymap have
to be?

We never progressed the project far enough to scale into the, push1000 projects through the server stage, so we never had to worryabout page faulting and running out of ram, though, there was aconcern that in an already tight 32-bit vm space, the server couldrun out of ram and/or vm.

To work well, one would have to check every unique identifier used toensure that it was not #defined. I wanted a language extension tomake the guarantee for me so that I would not have to check them all.

For checking, the initial implementation should also provide for away tocompare the result of the incremental reparse against a completerecompilation.

We handled this case by comparing the .s files compiled with theserver against one without.

Some of the information that is saved and updated in the aux fileor object file
is probably the same as what is saved in a GCH file.


:-)  We selected an in core database so avoid all the issues associated.

Would incremental update of GCH files be possible/interesting?

Nice question. In software almost anything is possible. Harder toknow if it is useful.

Should this all be integrated into the precompiled header framework?


Nice question.

* Changing a declaration (function arguments, default values), alsoaffects all
uses of the same declaration.


We did this by the fine grained dependancy tracking.

* Adding and removing a template specialization changes all uses ofthe template
after the declaration.


I don't think we handled this case.

* If code inside an inlined function body is changed, all (inlined)uses of the
function also change.


We did this by the fine grained dependancy tracking.

* What other cases like these have not yet been considered?

There are about 1000 more cases like that. I can begin listing themif you want. :-) Overload sets, RTX constant pools, debugginginformation, exception tables, deprecation and NODE_POISONED,IDENTIFIER_TYPENAME_P...

* How much space does a serialized token stream take?

gcc -E *.c | wc -c will tell you an approximate number. You thenjust * by a constant. I wouldn't worry about it.

* How much time is actually spent in the parser?

Seems like if you do a good job, you should be able to get a100-200x speedup, though, you're gonna have to do code-gen to get it.

* Future: Incremental code generation

I didn't find this part very hard to do. Harder was going to getperfect fidelity across all the language features and get enoughpeople interested in the project to get it into mainline.


I'd be happy to kibitz.

Re: SoC Project: Incremental Parsing (of C++)

Reply via email to