[whopr] Design/implementation alternatives for the driver and WPA

Diego Novillo Tue, 03 Jun 2008 09:46:14 -0700

We've started working on the driver and WPA components for whopr.
These are some of our initial thoughts and implementation strategy.  I
have linked these to the WHOPR page as well.  I'm hoping we can
discuss these at the Summit BoF, so I'm posting them now to start the
discussion.


Robert, Ollie, Rafael, I hope I haven't mangled the originals too
much.  Feel free to edit the wiki pages to fix anything I missed.  I
am pasting a text version to this message to simplify replies.  The
originals are at:

Driver: http://gcc.gnu.org/wiki/whopr/driver
WPA: http://gcc.gnu.org/wiki/whopr/wpa

Thanks.  Diego.

====================================================================

= WPA Implementation =
This document outlines two approaches for implementing WPA and
discusses their pros and cons.  For a full description of WPA, see the
WHOPR design document.

== Cherry-Picking ==
Under this proposal, the WPA phase leaves its input files unmodified.
Its output is one optimization plan per input file.  LTRANS reads each
plan and its associated object file.  Then, following the plan's
instructions, it cherry-picks specific inlinable functions from other
object files.  This approach is roughly equivalent to the 1-to-1
mapping approach described in the WHOPR design document.

=== Implementation Plan: ===
 1. Disable deserialization of function bodies during WPA.
 1. Disable non-IPA_PASS optimizations during WPA.
 1. Add serialization/deserialization of inlining decisions.
 1. Modify LTRANS to cherry-pick function bodies from non-primary
files.  Until we are able to disentangle type/object dependencies,
this will likely require reading in all DECL's from those files.  Flag
non-primary functions and DECL's to prevent duplicate assembly output.
 1. Add LTRANS driver (so a single gcc invocation runs WPA followed by LTRANS).

=== Pros: ===
 1. No direct-to-ELF serialization!  That's one less feature to implement.
 1. No need to index/repackage DECL's.  We just load everything from
the cherry-picked files.
 1. Probably easier to implement than the repackaging scheme.

=== Cons: ===
 1. We'll probably need to implement repackaging later.  Several
parallel build tools, like distcc, are stateless on the remote side
and don't have access to locally-mounted network filesystems.  The
cherry-picking approach will require transmission of multiple object
files per LTRANS process invocation.  For example, if a.o uses inlined
functions from b.o, c.o, and d.o, all four files must be transmitted
to re-compile a.o.
 1. If we pursue repackaging later, LTRANS cherry-picking is throw-away code.

== Repackaging ==
Under this proposal, WPA repackages its input files.  Each output file
consists of the contents of a primary input file plus additional
DECL's and functions required for inlining.  ELF data is output
directly so that functions don't need to be deserialized.  LTRANS
reads each output file without reference to other files.  Initially,
only inlining will be supported.  Because inlining decisions can also
be made at the LTRANS phase, IPA serialization may be deferred to
phase 2.  This is roughly equivalent to the
many-to-1/many-to-many/1-to-many mappings approach described in the
WHOPR design document.

=== Implementation Plan: ===
 1. Disable deserialization of function bodies during WPA.
 1. Disable non-IPA_PASS optimizations during WPA.
 1. Add support for outputting ELF directly.
 1. Add support for identifying and serializing subsets of DECL's
based on the collection of functions being output.  This probably
means adding a DECL index to each serialized function body.
 1. Add LTRANS driver (so a single gcc invocation runs WPA followed by LTRANS).

=== Pros: ===
 1. Closer to the approach we'll probably use in production.  Will
more easily integrate into parallel build tools while limiting excess
network transmission.
 1. Initially, we don't need to implement IPA serialization.
Repackaging implicitly allows LTRANS to perform inlining decisions
that would not otherwise be available.

== Cons: ==
 1. Requires implementing direct-to-ELF serialization.
 1. Requires (at least partial) re-serialization of DECL's and
per-function DECL indexes.
 1. Probably harder to implement than cherry-picking.


====================================================================



= WHOPR Driver Design =

This document proposes a driver design for WHOPR based on
the linker.  Although this document focuses on gold, but a similar
approach can also be implemented in GNU ld.

== Design Philosophy ==
 * The implementation provides complete transparency. Developers
should be able to take advantage of LTO without having to modify
existing build systems and/or Makefiles, all that's needed is to add
an LTO option (-flto).

 * Transparency is achieved through tight integration with the linker.
 Ideally, the linker communicates with LTO via a shared library
(plugin), eliminating any dependencies between the source bases of
linker and LTO, but other callback methods are also possible.

 * For scalability, we expect that after IPA multiple backend
invocations may/will follow. The system should be flexible enough to
accommodate existing parallel build infrastructures.

 * Debugability - debugging IPA and post-IPA problems can be
complicated. The design offers ways to simplify the overall strategy.

=== Why in the Linker? ===
As of this writing, the pre-ld driver collect2 performs the LTO file
identification. However, this is sub-optimal. The benefits of driving
LTO from the linker are:

 * The linker performs full symbol resolution. Therefore, it will only
bring in objects that are necessary.  This can greatly reduce build
and library extraction times.

 * Several build systems use ld -r to build components and/or shared libraries.

 * The linker properly handles archives

 * The linker knows which functions and globals are externally
referenced. [[http://llvm.org/docs/LinkTimeOptimization.html|LLVM's
IPA]] page provides an extended example on why the integration in the
linker is necessary to perform precise dead function elimination. The
same chain of arguments holds for globals. LTO needs to know about
externally referenced symbols.

 * Less work - currently, collect2 needs to fork/exec 'nm' on every
input file to determine whether it contains IR, which is not optimal.
In the new scheme, the linker will search for a particular section
(note: for ELF files, the linker traverses the section table in all
cases to find the symbol table).

= Process Structure =
The WHOPR design document outlines three drivers, LGEN (front-end
driver), WPA (actual IPA), and LTRANS (backend / code generation).
This section desribes how they call each other.

=== Front End: LGEN ===
LGEN is the independent FE driver, which produces files containing IR
and which can be invoked via any parallel build infrastructure.
Generation of IR is controlled by option {{{-flto}}}.

'''TODO''': Right now, LGEN puts a specifically named symbol in the
file to mark it as containing IR. This will change and a specifically
named section will be added instead.

=== Link: Collect2, gold, plugin ===
The link is either started with the gcc/g++ drivers (which call
collect2, which calls ld), or by calling ld (gold) directly. In the
gcc/g++ drivers ''and'' in collect2, files are still treated as
regular ELF files, nothing needs to be changed. This approach changes
the currently implemented strategy on the LTO branch. collect2
fork/exec's the linker.

The linker, upon start, examines a configuration file at a known
location relative to its own location. If this file exists, it
extracts the location of linker plugins (shared libraries) and loads
those.  A fixed set of function interfaces needs to be implemented in
the plugin, these functions are described below. One of many possible
plugins is a plugin that controls LTO.

Another way to locate a plugin would be via command-line.  This would
make it easier for two different compilers (and therefore two
different plugins) to use the same linker.

The linker performs regular symbol resolution. For each object file it
touches, it calls a specific function in the plugin (int
ldplugin_claim_file(const char *fname, size_t offset)). This
function returns 1 if it intends to claim a file (e.g. it contains
IR), and 0 if it doesn't.   The offset is used in the case of an
archive file. This way the plugin doesn't need to understand archives.

The linker marks each claimed file in its internal data structures and
continues with regular symbol resolution, until all references have
been resolved.

The linker also creates a list of all externally referenced symbols
and passes these to the plugin via the function
ldplugin_add_external_symbol(const char *mangled_name).

'''TODO''': Would it be better to pass an abstract object to
ldplugin_add_external_symbol? What should we pass to it if there are
two symbols in IL files with the same name?  One strong and one weak
for example.

At this point, the linker calls the main entry point to the pluging
(ldplugin_main(int argc, char *argv[]), passing its own arguments.
It's the plugin's responsibility to extract its related {{{-Wx,...}}}
values.


'''TODO''': Linker needs to understand these options. There will be a
single option 'letter' for all plugins, so plugins should be made
resilient against options they don't understand.

'''TODO''': How do we handle symbols defined in more then one file?
Should ldplugin_add_external_symbol take a abstract pointer/index into
the linker symbol table?

'''TODO''': What is passed to ldpluging_claim_file if the file is in a
.a file? '''TODO:'''Are we assuming that the files with IL contain a
normal symbol table? Should we make it possible for the plugin to call
back into the linker to add symbols? This should make it possible to
support a "full custom" file format for the IL files.

=== Plugin ===
The plugin munches the options passed to it. It already has a list of
all input object files containing IR, as well as a list of the
external references. Note, we could also pass in the list of all other
regular object files to it. Some of these files might be located in an
archive.

The plugin performs these actions:

 * It creates and manages a temporary directory for all intermediate files.

 * It manages the DEBUG facility. For example, to debug post-WPA
 problems, one needs the various outputs of WPA. In other words,
 intermediate files need to be kept. DEBUG should allow naming
 temporary directories, and control other DEBUG related behavior (e.g.
 dumping options).

 * It extracts IR object files from archives and places them in the
 tmp directory. This may be done via fork/'exec'ing 'ar x ...', or
 directly calling linker helper functions. To avoid name collision,
 every generated and/or copied object file gets a running serial
 number. This way, when two files or archives from different
 directories participate in a link, no further name collision will
 occur.

 * The plugin creates a REDO script which contains the exact command
 lines for the original link and WPA, as well as the environment as it
 was during the original build. The WPA command line contains all the
 options and the extracted IR files. REDO will also build an ld
 command line where archives are replaced with the extracted object
 files. This REDO script allows restarting WPA, and restarting the
 final link (with some magic). The redo script is essential for
 automatic triaging.

 * If automatic triaging is used to identify performance regressions,
 a subtle corner case may arise related to code layout. This will be
 addressed later.

 * The plugin constructs the command line for WPA (options + IR files)
and fork/exec's it.

 * The plugin "collects" resulting real object files and feeds them
back to the linker.

=== Inter-Procedural Optimization: WPA ===

WPA parses command-line and does its thing. It will generate 1..N post
IPA IR files for LTRANS. Depending on the model, the post-IPA IR files
don't need to have symbol table. Single post-IPA files or groups of
such files will be passed to LTRANS invocations. These invocations are
independent and can be parallelized. WPA will create a list containing
these file groups. For each group a list of specific command-line
options to LTRANS can be specified, as well as its designated output
file name, e.g.:

0.o base.a.threads.o  -O3
1.o base.a.walltime.o inline-candidate-1.o inline-candidate-2.o
2.o myapp.o 2.o

WPA calls the parallel "LTRANS magic", which, by default, is a script
in a default location, let's call it ltrans_ctrl. Command line options
should allow to specify alternative scripts. The location of the tmp
directory, the name of the control file, as well as all original
command line options to WPA are being passed to ltrans_ctrl. It is
ltrans_ctrl's role to support various existing build systems:

'''local build - parallel make'''

For local builds on multi-core machines, parallel make can be used
efficiently, as it already does process management. For this scenario,
  ltrans_ctrl may call a script ltrans_parallel_make, which

 * identifies the current platform (uname -a), finds and identifies 'make'

 * generates a Makefile

 * invokes make -s -j ''x'' -f Makefile

To customize LTO for a specific installation, ltrans_parallel_make can
be customized using the output from getconf _NPROCESSORS_ONLN to
specify parallelism ''x'' as a default, and to use an environment
variable to allow overriding. The generated Makefile might look like
this:

goal: 0.o 1.o 2.o

0.o: base.a.threads.o
   ltrans -O3 -o 0.o base.a.threads.o

1.0: base.a.walltime.o inline-candidate-1.o inline-candidate-2.o
  ltrans -o 1.o base.a.walltime.o inline-candidate-1.o inline-candidate-2.o

...

This mechanism works for regular make and gmake, for which only the
parameters need to change. There are issues that all generated file
must be visible on all build machines for the dependency mechanism to
work. This can usually be achieved by making sure the build happens on
NFS, or by introducing pseudo targets and remote copy operations in
the Makefile.


'''distributed build - distcc'''
TBD - but should be similar. The related files will be copied to build
server, ltrans will be invoked there, the resulting object file will
be copies back. As a matter of fact, if there was an LTRANS wrapper
script for that, the Makefile infrastructure could be reused. The
wrapper script would have to:

 * for a given target, select a build server.

 * generate unique temporary name and directory on build server

 * copy involved files to this location

 * secure shell invoke ltrans with proper parameters

 * scp back the resulting real .o

 * srm tmp directory.

== Final Link - ld ==
After all real object files have been generated, these files, along
with the rest of the originally passed real object files, need to be
passed to the linker. There are a few ways to do this:

 * Call a plugin / linker interface which allows to explicitly add
 files to the linker's internal data structures. '''TODO''': Unclear
 about the consequences for linker file/code generation.

 * Restart the linker with a new command line, where all original real
 objects and the objects are being passed in. There are subtle
 problems possible in terms of symbol resolution. Well - these
 problems are always there, unless a 1x1 mapping from pre- to post-IPA
 object files exist.

 * WPA could call the linker, it has all proper command line options,
 the plugin could do it, but only with difficulties, as WPA decides on
 the actual number and names of the final real .o files. The plugin
 could just pick up any object files it finds in the tmp directories,
 but this may introduce problems - in case of actual problems or
 debugging.

 * What about adding individual symbols via an API call? The linker
 will still be running during WPA. The plugin can collect the symbols
 and pass them back to the linker. With this it shouldn't be necessary
 to restart the linker. Final strategy to be determined.

== Cleanup ==
The plugin cleans up all temporary directories, unless directed not to.

== Plugin Interfaces ==
The plugin function entry points have C linkage. From linker to plugin:

// pass an object file name to the plugin.
// return 1: plugin can make use of file
// return 0: no use for plugin
int ldplugin_claim_file(const char *fname, size_t offset);

int ldplugin_claim_archive_file(const char *archive, const char *fname);

// pass external reference to plugin
void ldplugin_add_external_symbol(const char *mangled_name);


// call plugin's main entry point
// return 0 on success or an error code
int ldplugin_main(int argc, char *argv[]);


// finalize plugin's job, clean up
void ldplugin_cleanup();

Linker provided interfaces (from plugin to linker):

// query symbol attribute (pre-emptive, size, etc)
int ld_query_symbol_attribute(const char *symbol_name, enum ld_query query);

// after WPA, pass a real object file back to linker
void ld_add_object_file(const char *orig_object_fname, const char
*post_wpa_fname);

=== Issues ===
 * Question: How do things work in the linker if 10 files are on the
 original link line, but only 3 files come back? Can the linker be
 made to ignore the other files?

 * Question: The symbol attribute query needs to be refined - what are
 we going to query exactly?

 * Command line options: A FE file might be compiled with a special
 option, such as optimization level. Question: How is this information
 stored in the object file? How is the scenario handled where 2 IPA
 files are being compiled at different optimization levels. What does
 WPA do? There are many ways to do things - we need to decide on one.

[whopr] Design/implementation alternatives for the driver and WPA

Reply via email to