Hi, since it seems that we are getting ready with LTO, I think it is time to write updated design document for callgraph and implementation of whole program optimizer as outlined in Whopr document http://gcc.gnu.org/wiki/whopr As usual, all comments or ideas are welcome ;)
Basic organization ================== As described in whopr document, to have linktime optimizer scalable for whole programs of reasonable size, we need to: A) Push as much as possible into compilation stage, including preparing of "summary" information for interprocedural optimization and write that on disk. B) At linktime stage do all interprocedural optimization decisions based solely on the callgraph, variable pool and the summaries present in object files without actually needing to examine all function bodies. We have to expect that all function bodies won't fit in memory or take long to load. C) Final compilation should apply the decisions of interprocedural optimization and do local optimization and should be possible to distribute it. So every interprocedural optimization pass should be split into 3 phases: 1) local analysis producing simple and small function summary based on the function body analyzed alone. 2) global execution that based on the function summaries, callgraph and variable pool do the optimization decisions, note them in callgraph and varpool and somehow possibly update the local information of later IPA passes. The optimization summaries produced should be again small, simple and possible to save on disk. 3) local modification that takes the function body or variable initializer and applies changes based on optimization summary produced in 2). This is just rough outline: in real implementation we will probably want to do some limited interprocedural optimization at compilation stage (especially early inlining that enables a lot more optimization in C++) and we also will probably want to do some late interprocedural optimization on fragments of the whole program, such as IPA register allocation, that can't use summary produced so early in optimization. Current implementation status and short term plans ================================================== The IPA organized into separate analyze/execution/modify passes was part of original cgraph design. The design was outlined in 2004 GCC summit paper http://www.gccsummit.org/2004/2004-GCC-Summit-Proceedings.pdf Inliner was organized this way, later passes wasn't organized and enforcing this organization was difficult given the fact that linktime optimization wasn't in sight http://gcc.gnu.org/ml/gcc/2004-11/msg00694.html so the code was partly removed from mainline. Since inliner is one of more interesting passes in this point of view, reorganizing rest of optimizer should be that difficult at all. I planned to open IPA branch and do transition back to the organization, however it now seems that it is better done form part at stage1 directly on mainline (most changes was prototyped on IPA branch and worked well) and by part on LTO branch that now offer way to actually test the whole framework. This way the amount of changes at LTO branch remains localized and we won't run into conflict where IPA branch would need to be based on LTO branch to progress and vice versa. The plan would be to update passmanager first and transit main IPA passes (inliner, constant propagation, alias analysis) on mainline. The more advanced IPA stuff, as struct reorg can go later since it is not a showstopper for first incarnation of whole program optimizer anyway. On LTO branch we can do changes related to memory management and overall pass queue organization. Basic datastructure organization ================================ The callgraph and variable pool nodes are now split into the generic parts (such as callgraph edges) and the per-optimization summaries. Those all should be split into local and global parts: - local containing the function summary computed by stage A, - global part containing the result of computation. Temporary storage needed by the pass itself should be allocated in on-side array based on cgraph UIDs or in aux pointers. The local and global data are available via cgraph_local_info / cgraph_global_info accessors since the plan is to un-embed them from cgraph nodes once the memory consumption will justify it and also the functions perform an sanity checking. So implementation plan here would be just to enforce this split for all passes that wasn't exactly implemented this way. Basic passmanager changes ========================= At the moment passmanager operates in two modes: 1) IPA mode for toplevel passes: those passes are executed in sequence once for compilation. 2) local mode for nested passes: those passes are executed in sequence (that is all nested passes in sequence up to next IPA pass) in topological order of callgraph on each function. I would like to (re)-add the following hooks /* For IPA pass only. Analyze the given function body and produce summary information. */ void analyze_function (struct cgraph_node *); /* For IPA pass only. Analyze the given variable initializer and produce summary information. */ void analyze_variable (struct varpool_node *); /* For IPA pass only. Read summary from disk. */ void read_function_local_info (struct cgraph_node *, whatever parameters needed by LTO implementation); void read_variable_local_info (struct varpool_node *, whatever parameters needed by LTO implementation); /* For IPA pass only. Apply changes to given function body. */ void modify_function (struct cgraph_node *); /* For IPA pass only. Apply changes to given variable body. */ void modify_body (struct varpool_node *); /* For IPA pass only. Write summary to disk. */ void write_function_local_info (struct cgraph_node *, whatever parameters needed by LTO implementation); void write_variable_local_info (struct varpool_node *, whatever parameters needed by LTO implementation); For implementing the stage C by whopr document (ie be able to produce .o files with decisions from global optimization in them), we would also need two extra hook for reading and writing ipa_optimization_info, but I would leave this out. I would propose doing this change along with killing RTL dump letter fields, since most annoying change of this is actually updating all the initializers of all GCC passes by hand. BTW what about instead of adding 8 NULL fields to each initializer adding a simple macro, like IPA_PASS (analyze_fun, analyze_var, write_fun, write_var, read_fun, read_var, execute, modify_fun, modify_var) LOCAL_PASS (execute_fun) RTL_PASS (execute_fun) macros so we don't have to go over it again? I would be happy to do the non-macroized change however. With these extra hooks, passmanager queue could be organized as follows: all_lowering_passes: executed per function as done now. all_early_ipa_passes: queue consisting of IPA passes with only execute function set. Here we will do things like early inlining, early optimization and similar passes. all_interunit_ipa_passes: IPA passes with analyze/execute/modify pair. Pass manager will execute them after early_ipa_passes and will call all analyze hooks first. Possibly followed by write hooks and exit, or with execute next and modify hooks last based on the fact if we do LTO or not. all_late_ipa_passes: If we opt for having late small IPA optimizer, we can put passes here. Probably not in initial implementation. all_passes: Local optimization passes as we do now executed on topological order. This can be subpass of last pass of all_interunit_ipa_passes too. With LTO linktime optimization the queue will start with all_interunit_ipa_passes with read hooks followed by execute and modify hooks. Memory management of function bodies =================================== On LTO branch I would like to use cgraph_fuction_body_needed to bring function body into memory on demand and cgraph_function_body_not_needed to schedule function body for removal from memory (probably done lazily at next GGC cycle). The functions do simple reference counting: the idea is that at the last stage passes, like inlining, can simply as for body as needed and can do it several times for example when inlining recursively. Finally there is function cgraph_function_body_modified that locks the body in memory so it is not re-read from disk and it is removed only after the function body is compiled to disk or declared as no longer needed. This is particularly used by a passmanager when doing final local optimization to keep the function currently being worked on in memory. However it is intended to be used by passes that during their execute stage needs to produce new function or modify them in complicated ways. Basic locking/unlocking will thus be driven by passmanager for local passes, while passes needing additional bodies, such as inliner, will do their own requests. Interaction of optimization passes and virtual clones ===================================================== Now to the interesting part. Since all function bodies are analyzed first by all passes, the passes while they are designed to execute in sequence actually all execute in parallel. This means that the pass deciding to do particular optimization at its execution stage is responsible to update summaries of all later passes so they are not confused by out of date information. It would be nice if the passes was actually able to derive benefits from earlier analysis: For example if constant propagation modify the function so some indirect call become direct, we want inliner to be able to decide to inline it. This is a must for devirtualization. It seems that all IPA passes we think about falls into categories: 1) Purely analyzing passes not modifying the functions (such as alias analysis or pure/const discovery). This is easy category. 2) Passes doing function cloning based on IPA analysis (inlining, IPCP) 3) Passes doing simple replacement (such as discovery of constant static vars) 4) More complex stuff (such as struct-reorg modifying the datastructures and algorithms in functions to match the changes) I think we can have few generic ways to perform changes we want: Virtual clones ============== Fortunately most of basic passes we plan can be summarized as function cloning: either the passes actually do cloning, like IPCP, or can be seen that way. Inlining, for example, can be seen as a pass doing function clone that ends up in the function body of the other function. Callgraph implements so called virtual clones. Real clones currently done by current IPCP are implemented as ordinary functions from cgraph POV. The function body is fully copied, new declaration is crated and all call statements in program rewritten. The virtual clones consist of callgraph node pointing to declaration of different function body + description how to produce the function based on it. So creating the clone imply only creation of new cgraph node and redirection of edges without touching the statements themselves. The inlining decisions are currently represented as inline plan: when function is decided to be inlined, new virtual clone is built and placed in callgraph. All inline clones have just one caller and if the offline function is not needed, last inline clone is replacing it. After removal of analyze/modify hooks from passmanager, the inliner is split into 3 passes (pass_inline_parameters, pass_inline and pass_apply_inline) doing the analyze/execute/modfify sequence. This helps to keep memory usage down not constructing all inlined function bodies in memory at once. The clones seems like good abstraction, since they are quite transparent to most of IPA passes: by copying around the function summary to the clone we basically update the summary and the execute portion of pass will actually get same results as if the summary of caller and inlined callee was somehow merged. This can be all done without any knowledge about what the other passes are doing. So the plan is to turn IPCP and other passes from doing real cloning into same virtual cloning. Late introduction of functions ============================== In some other cases it seems that IPA passes needs means of introducing new function late. This is currently done by profiling (that introduce a constructor), for function calling constructors or OpenMP. All function bodies are now transiting over following stages: NEW: function is around but nothing is done about it Finalized: Function body is built and passed from fronted to callgraph Lowered: Function body is gimplified and brought into low gimple via all_lowering_passes in passmanager Analyzed: Callgraph edges are built Locally optimized: Early optimization passes are run and function is in SSA Compiled: rest_of_compilation (all passes) are run As the compilation transits from parsing, to cgraph_optimize stages and later the IPA pass queue is executed, all the function bodies are brought to appropriate stages in parallel. New functions are irritating, since the pass would be required to bring it to proper stage that is bit difficult. This is handled by cgraph_add_new_function that basically schedules the function for addition and at the end of each IPA pass it is brought from its current stage (whatever it is) to the stage same as other functions. This can be updated to for IPA passes too: if IPA does cgraph_add_new_function call, we can schedule all the analyze hooks so later passes will see the function as if it was in the callgraph forever. The function will need to be locked in memory and thus it would be responsibility of the IPA pass to not build too many functions this way. This is however consistent with idea that IPA passes should not grow the program too much anyway. Reanalysis of function ====================== Similarly as new function can be added late, if we decide to modify the function early in complicated way and lock in in memory: This can be done by doing all modify hooks of the earlier passes on function in question, aply the changes and re-call analyze hooks of all later passes noting the fact that it was done so in cgraph_node, so at the end of compilation only the later modify hooks will be executed. This way the later passes will see the results of the earlier pass. All we need is just to ensure that analyze hook of given pass can be executed multiple times on each function that is easy to maintain. Again this imply locking function in memory and should not be used in passes that can do a lot of changes and can be better dealt with differently, but it seems like good solution for me for some specific situations. The rest ======== I don't have much plans for the rest: I guess we need to handle them case by case and probably implement some knowledge of all other passes into more difficult transformations or not do the more difficult transformation at this level of IPA at all using the "small IPA" plan described earlier. For all current passes with exception of struct_reorg it is quite clear we won't need this. For struct_reorg we can probably just apply the reanalysis idea if we don't do too large scale changes. Late IPA passes =============== One idea is that because the large IPA passes have significant pass ordering difficulties outlined above and because result of IPA optimization might simplify the program significantly so more IPA is possible, it seems reasonable to think about late IPA passes. Those would operate just on portion of program that fits in memory based on some region choice done by the big IPA passes and do things like: 1) do local optimization of function bodies after results of big IPA passes was applied 1) more precise alias analysis/inlining/wathever we find that works better 2) IPA register allocation and other stuff that needs information at very low (RTL) level to operate effectively Thats about it. I would welcome all comments and if it seems useful I can turn it into wiki page adding details to the current implementation plan at wiki. Honza