We've started working on the driver and WPA components for whopr. These are some of our initial thoughts and implementation strategy. I have linked these to the WHOPR page as well. I'm hoping we can discuss these at the Summit BoF, so I'm posting them now to start the discussion.
Robert, Ollie, Rafael, I hope I haven't mangled the originals too much. Feel free to edit the wiki pages to fix anything I missed. I am pasting a text version to this message to simplify replies. The originals are at: Driver: http://gcc.gnu.org/wiki/whopr/driver WPA: http://gcc.gnu.org/wiki/whopr/wpa Thanks. Diego. ==================================================================== = WPA Implementation = This document outlines two approaches for implementing WPA and discusses their pros and cons. For a full description of WPA, see the WHOPR design document. == Cherry-Picking == Under this proposal, the WPA phase leaves its input files unmodified. Its output is one optimization plan per input file. LTRANS reads each plan and its associated object file. Then, following the plan's instructions, it cherry-picks specific inlinable functions from other object files. This approach is roughly equivalent to the 1-to-1 mapping approach described in the WHOPR design document. === Implementation Plan: === 1. Disable deserialization of function bodies during WPA. 1. Disable non-IPA_PASS optimizations during WPA. 1. Add serialization/deserialization of inlining decisions. 1. Modify LTRANS to cherry-pick function bodies from non-primary files. Until we are able to disentangle type/object dependencies, this will likely require reading in all DECL's from those files. Flag non-primary functions and DECL's to prevent duplicate assembly output. 1. Add LTRANS driver (so a single gcc invocation runs WPA followed by LTRANS). === Pros: === 1. No direct-to-ELF serialization! That's one less feature to implement. 1. No need to index/repackage DECL's. We just load everything from the cherry-picked files. 1. Probably easier to implement than the repackaging scheme. === Cons: === 1. We'll probably need to implement repackaging later. Several parallel build tools, like distcc, are stateless on the remote side and don't have access to locally-mounted network filesystems. The cherry-picking approach will require transmission of multiple object files per LTRANS process invocation. For example, if a.o uses inlined functions from b.o, c.o, and d.o, all four files must be transmitted to re-compile a.o. 1. If we pursue repackaging later, LTRANS cherry-picking is throw-away code. == Repackaging == Under this proposal, WPA repackages its input files. Each output file consists of the contents of a primary input file plus additional DECL's and functions required for inlining. ELF data is output directly so that functions don't need to be deserialized. LTRANS reads each output file without reference to other files. Initially, only inlining will be supported. Because inlining decisions can also be made at the LTRANS phase, IPA serialization may be deferred to phase 2. This is roughly equivalent to the many-to-1/many-to-many/1-to-many mappings approach described in the WHOPR design document. === Implementation Plan: === 1. Disable deserialization of function bodies during WPA. 1. Disable non-IPA_PASS optimizations during WPA. 1. Add support for outputting ELF directly. 1. Add support for identifying and serializing subsets of DECL's based on the collection of functions being output. This probably means adding a DECL index to each serialized function body. 1. Add LTRANS driver (so a single gcc invocation runs WPA followed by LTRANS). === Pros: === 1. Closer to the approach we'll probably use in production. Will more easily integrate into parallel build tools while limiting excess network transmission. 1. Initially, we don't need to implement IPA serialization. Repackaging implicitly allows LTRANS to perform inlining decisions that would not otherwise be available. == Cons: == 1. Requires implementing direct-to-ELF serialization. 1. Requires (at least partial) re-serialization of DECL's and per-function DECL indexes. 1. Probably harder to implement than cherry-picking. ==================================================================== = WHOPR Driver Design = This document proposes a driver design for WHOPR based on the linker. Although this document focuses on gold, but a similar approach can also be implemented in GNU ld. == Design Philosophy == * The implementation provides complete transparency. Developers should be able to take advantage of LTO without having to modify existing build systems and/or Makefiles, all that's needed is to add an LTO option (-flto). * Transparency is achieved through tight integration with the linker. Ideally, the linker communicates with LTO via a shared library (plugin), eliminating any dependencies between the source bases of linker and LTO, but other callback methods are also possible. * For scalability, we expect that after IPA multiple backend invocations may/will follow. The system should be flexible enough to accommodate existing parallel build infrastructures. * Debugability - debugging IPA and post-IPA problems can be complicated. The design offers ways to simplify the overall strategy. === Why in the Linker? === As of this writing, the pre-ld driver collect2 performs the LTO file identification. However, this is sub-optimal. The benefits of driving LTO from the linker are: * The linker performs full symbol resolution. Therefore, it will only bring in objects that are necessary. This can greatly reduce build and library extraction times. * Several build systems use ld -r to build components and/or shared libraries. * The linker properly handles archives * The linker knows which functions and globals are externally referenced. [[http://llvm.org/docs/LinkTimeOptimization.html|LLVM's IPA]] page provides an extended example on why the integration in the linker is necessary to perform precise dead function elimination. The same chain of arguments holds for globals. LTO needs to know about externally referenced symbols. * Less work - currently, collect2 needs to fork/exec 'nm' on every input file to determine whether it contains IR, which is not optimal. In the new scheme, the linker will search for a particular section (note: for ELF files, the linker traverses the section table in all cases to find the symbol table). = Process Structure = The WHOPR design document outlines three drivers, LGEN (front-end driver), WPA (actual IPA), and LTRANS (backend / code generation). This section desribes how they call each other. === Front End: LGEN === LGEN is the independent FE driver, which produces files containing IR and which can be invoked via any parallel build infrastructure. Generation of IR is controlled by option {{{-flto}}}. '''TODO''': Right now, LGEN puts a specifically named symbol in the file to mark it as containing IR. This will change and a specifically named section will be added instead. === Link: Collect2, gold, plugin === The link is either started with the gcc/g++ drivers (which call collect2, which calls ld), or by calling ld (gold) directly. In the gcc/g++ drivers ''and'' in collect2, files are still treated as regular ELF files, nothing needs to be changed. This approach changes the currently implemented strategy on the LTO branch. collect2 fork/exec's the linker. The linker, upon start, examines a configuration file at a known location relative to its own location. If this file exists, it extracts the location of linker plugins (shared libraries) and loads those. A fixed set of function interfaces needs to be implemented in the plugin, these functions are described below. One of many possible plugins is a plugin that controls LTO. Another way to locate a plugin would be via command-line. This would make it easier for two different compilers (and therefore two different plugins) to use the same linker. The linker performs regular symbol resolution. For each object file it touches, it calls a specific function in the plugin (int ldplugin_claim_file(const char *fname, size_t offset)). This function returns 1 if it intends to claim a file (e.g. it contains IR), and 0 if it doesn't. The offset is used in the case of an archive file. This way the plugin doesn't need to understand archives. The linker marks each claimed file in its internal data structures and continues with regular symbol resolution, until all references have been resolved. The linker also creates a list of all externally referenced symbols and passes these to the plugin via the function ldplugin_add_external_symbol(const char *mangled_name). '''TODO''': Would it be better to pass an abstract object to ldplugin_add_external_symbol? What should we pass to it if there are two symbols in IL files with the same name? One strong and one weak for example. At this point, the linker calls the main entry point to the pluging (ldplugin_main(int argc, char *argv[]), passing its own arguments. It's the plugin's responsibility to extract its related {{{-Wx,...}}} values. '''TODO''': Linker needs to understand these options. There will be a single option 'letter' for all plugins, so plugins should be made resilient against options they don't understand. '''TODO''': How do we handle symbols defined in more then one file? Should ldplugin_add_external_symbol take a abstract pointer/index into the linker symbol table? '''TODO''': What is passed to ldpluging_claim_file if the file is in a .a file? '''TODO:'''Are we assuming that the files with IL contain a normal symbol table? Should we make it possible for the plugin to call back into the linker to add symbols? This should make it possible to support a "full custom" file format for the IL files. === Plugin === The plugin munches the options passed to it. It already has a list of all input object files containing IR, as well as a list of the external references. Note, we could also pass in the list of all other regular object files to it. Some of these files might be located in an archive. The plugin performs these actions: * It creates and manages a temporary directory for all intermediate files. * It manages the DEBUG facility. For example, to debug post-WPA problems, one needs the various outputs of WPA. In other words, intermediate files need to be kept. DEBUG should allow naming temporary directories, and control other DEBUG related behavior (e.g. dumping options). * It extracts IR object files from archives and places them in the tmp directory. This may be done via fork/'exec'ing 'ar x ...', or directly calling linker helper functions. To avoid name collision, every generated and/or copied object file gets a running serial number. This way, when two files or archives from different directories participate in a link, no further name collision will occur. * The plugin creates a REDO script which contains the exact command lines for the original link and WPA, as well as the environment as it was during the original build. The WPA command line contains all the options and the extracted IR files. REDO will also build an ld command line where archives are replaced with the extracted object files. This REDO script allows restarting WPA, and restarting the final link (with some magic). The redo script is essential for automatic triaging. * If automatic triaging is used to identify performance regressions, a subtle corner case may arise related to code layout. This will be addressed later. * The plugin constructs the command line for WPA (options + IR files) and fork/exec's it. * The plugin "collects" resulting real object files and feeds them back to the linker. === Inter-Procedural Optimization: WPA === WPA parses command-line and does its thing. It will generate 1..N post IPA IR files for LTRANS. Depending on the model, the post-IPA IR files don't need to have symbol table. Single post-IPA files or groups of such files will be passed to LTRANS invocations. These invocations are independent and can be parallelized. WPA will create a list containing these file groups. For each group a list of specific command-line options to LTRANS can be specified, as well as its designated output file name, e.g.: 0.o base.a.threads.o -O3 1.o base.a.walltime.o inline-candidate-1.o inline-candidate-2.o 2.o myapp.o 2.o WPA calls the parallel "LTRANS magic", which, by default, is a script in a default location, let's call it ltrans_ctrl. Command line options should allow to specify alternative scripts. The location of the tmp directory, the name of the control file, as well as all original command line options to WPA are being passed to ltrans_ctrl. It is ltrans_ctrl's role to support various existing build systems: '''local build - parallel make''' For local builds on multi-core machines, parallel make can be used efficiently, as it already does process management. For this scenario, ltrans_ctrl may call a script ltrans_parallel_make, which * identifies the current platform (uname -a), finds and identifies 'make' * generates a Makefile * invokes make -s -j ''x'' -f Makefile To customize LTO for a specific installation, ltrans_parallel_make can be customized using the output from getconf _NPROCESSORS_ONLN to specify parallelism ''x'' as a default, and to use an environment variable to allow overriding. The generated Makefile might look like this: goal: 0.o 1.o 2.o 0.o: base.a.threads.o ltrans -O3 -o 0.o base.a.threads.o 1.0: base.a.walltime.o inline-candidate-1.o inline-candidate-2.o ltrans -o 1.o base.a.walltime.o inline-candidate-1.o inline-candidate-2.o ... This mechanism works for regular make and gmake, for which only the parameters need to change. There are issues that all generated file must be visible on all build machines for the dependency mechanism to work. This can usually be achieved by making sure the build happens on NFS, or by introducing pseudo targets and remote copy operations in the Makefile. '''distributed build - distcc''' TBD - but should be similar. The related files will be copied to build server, ltrans will be invoked there, the resulting object file will be copies back. As a matter of fact, if there was an LTRANS wrapper script for that, the Makefile infrastructure could be reused. The wrapper script would have to: * for a given target, select a build server. * generate unique temporary name and directory on build server * copy involved files to this location * secure shell invoke ltrans with proper parameters * scp back the resulting real .o * srm tmp directory. == Final Link - ld == After all real object files have been generated, these files, along with the rest of the originally passed real object files, need to be passed to the linker. There are a few ways to do this: * Call a plugin / linker interface which allows to explicitly add files to the linker's internal data structures. '''TODO''': Unclear about the consequences for linker file/code generation. * Restart the linker with a new command line, where all original real objects and the objects are being passed in. There are subtle problems possible in terms of symbol resolution. Well - these problems are always there, unless a 1x1 mapping from pre- to post-IPA object files exist. * WPA could call the linker, it has all proper command line options, the plugin could do it, but only with difficulties, as WPA decides on the actual number and names of the final real .o files. The plugin could just pick up any object files it finds in the tmp directories, but this may introduce problems - in case of actual problems or debugging. * What about adding individual symbols via an API call? The linker will still be running during WPA. The plugin can collect the symbols and pass them back to the linker. With this it shouldn't be necessary to restart the linker. Final strategy to be determined. == Cleanup == The plugin cleans up all temporary directories, unless directed not to. == Plugin Interfaces == The plugin function entry points have C linkage. From linker to plugin: // pass an object file name to the plugin. // return 1: plugin can make use of file // return 0: no use for plugin int ldplugin_claim_file(const char *fname, size_t offset); int ldplugin_claim_archive_file(const char *archive, const char *fname); // pass external reference to plugin void ldplugin_add_external_symbol(const char *mangled_name); // call plugin's main entry point // return 0 on success or an error code int ldplugin_main(int argc, char *argv[]); // finalize plugin's job, clean up void ldplugin_cleanup(); Linker provided interfaces (from plugin to linker): // query symbol attribute (pre-emptive, size, etc) int ld_query_symbol_attribute(const char *symbol_name, enum ld_query query); // after WPA, pass a real object file back to linker void ld_add_object_file(const char *orig_object_fname, const char *post_wpa_fname); === Issues === * Question: How do things work in the linker if 10 files are on the original link line, but only 3 files come back? Can the linker be made to ignore the other files? * Question: The symbol attribute query needs to be refined - what are we going to query exactly? * Command line options: A FE file might be compiled with a special option, such as optimization level. Question: How is this information stored in the object file? How is the scenario handled where 2 IPA files are being compiled at different optimization levels. What does WPA do? There are many ways to do things - we need to decide on one.