Wow, lots of comments there, Mike ;-)
We saw a 41% speed-up for SimpleText, a 110x peak speedup for <Carbon.h> and (cstdlib). A C++ Carbon hello world was 91x faster, peak. C hello world was the same speed. Peak speedups for C 2x, for C++ 142x.
Cool! After some measurements (-ftime-report, tee, grep, cut and perl) on a C++ codebase (about 200kloc, ~341 files), it seems that about 50% of compilation time is spent in the parser, which limits the possible speed-up of my approach to about 2x. Which wouldn't be small, but still nothing compared to those numbers. Additionally, about 17% is spent in "name lookup" (I'm not sure here what kinds of name lookups are actually measured, and how much of it that would be eliminated by parsing incrementally), and 13% in preprocessing (this time wouldn't change at all).
> A number of changes in source affect more than the tree of the > construct itself > (inline functions, function default values, templates etc.. see > Problems below). > In these cases, the initial implementation would just identify the > dangerousness > of the changes, abort, and initiate a complete recompilation of the > file. We validated state on a fine grained basis. Such checking can be expensive, as odd as that sounds.
My idea was to initially just check for any not obviously safe changes, and later in the project try to determine the most important kind of changes to handle intelligently. That is, a change would be considered dangerous until proven safe. But I guess I might find that almost nothing at all can be done without that fine-grained checking. Would it be possible to calculate the dependencies from the tree of a function? If so, the parser could go through all unchanged trees in their file order and check for a Changed flag in the declarations referred to, recompiling the tree and marking it as changed if any of its dependents were changed. This wouldn't properly handle all cases though... A reference to a function could be annotated with the overload set and any function added/deleted/changed could mark its overload set as changed, but that is probably only one of a load of cases. (More research certainly needed.)
We never progressed the project far enough to scale into the, push 1000 projects through the server stage, so we never had to worry about page faulting and running out of ram, though, there was a concern that in an already tight 32-bit vm space, the server could run out of ram and/or vm.
Hmm.. Well, this would only be within one compilation unit, but still I guess you could have a pathological file with 2^16 templates, all interdependent, in which case you'd have 2^32 dependency links, and quickly overflow most limits. Before letting this thing loose on users, there should be code to detect such a case and restart a normal compilation. But my priority would be to get the internal stuff working, not being nice to users. In the worst case, you'd just keep on going naively, crash, and let the build script retry a non-incremental compile.
To work well, one would have to check every unique identifier used to ensure that it was not #defined. I wanted a language extension to make the guarantee for me so that I would not have to check them all.
That specific problem would be taken care of by working on already preprocessed data. But the problem is similar to that of overloaded functions which would need to be considered.
> For checking, the initial implementation should also provide for a > way to > compare the result of the incremental reparse against a complete > recompilation. We handled this case by comparing the .s files compiled with the server against one without.
Nice and simple! No need to compare trees then.
There are about 1000 more cases like that. I can begin listing them if you want. :-) Overload sets, RTX constant pools, debugging information, exception tables, deprecation and NODE_POISONED, IDENTIFIER_TYPENAME_P...
Which of these are only of importance when doing code generation?
> * How much space does a serialized token stream take? gcc -E *.c | wc -c will tell you an approximate number. You then just * by a constant. I wouldn't worry about it. > * How much time is actually spent in the parser? Seems like if you do a good job, you should be able to get a 100-200x speedup, though, you're gonna have to do code-gen to get it. > * Future: Incremental code generation I didn't find this part very hard to do. Harder was going to get perfect fidelity across all the language features and get enough people interested in the project to get it into mainline.
How good did you manage to get the compile server in terms of fidelity? I'm happy to forego the extra gains (although significant) for now, in exchange for fidelity, i.e. start with "100%" correct code generation without incrementality and then try to add more incrementality while keeping it "100%" correct, where "100%" is where the rest of the compiler is at. Trying to do all these things at once is probably a contributing factor to the compile server not being in the mainline. Nevertheless, I'd like to think of it as a secondary goal of the project to produce something that *could* be used to drive incremental code generation (I guess that's changed-flags and, if we generate it, any dependency information generated), if someone was interested to take on such a project.
I'd be happy to kibitz.
Happily accepting advice where needed/provided! ;-) BTW, I found the old compile server paper here: http://per.bothner.com/papers/GccSummit03/gcc-server.pdf - which was certainly a multi-person multi-year project. I haven't found much other info on the compile server project, other than that it was worked on long ago, and some time since stagnated. I can certainly see quite a few similarities between this project and the compile server, but compared to the compile server, my project proposal is basically to make the compile server simple enough to get some gains without massive compiler surgery (i.e. by disregarding code generation and preprocessing, things become simple(r), at least in theory). My hope is that through keeping it simple, it could become good enough easily/quickly enough to be useful, whereas the compile server seemed to aim for the sun and just about never get those wings to stop melting ;-) That said, it should be useful to look at the compile server's solution to a number of problems, esp. with regard to dependency tracking. Many common problems do exist after all. Anyways, how far did the compile server go? What was left to do when development stopped? And why did development stop?