Hi,
> diego and honza
> 
> diego asked on irc were we planning to be able to serialize out all of
> the combined declarations.
> My response we could certainly use the machinery that we currently have
> to do this.
> 
> However, I believe that Deigo's motivation is somewhat flawed, and for
> that matter the plans in the whoper document.
> 
> I think that it is clear that for lto to be successful, we are going to
> have to distribute the computation among either multiple processors in a
> single box or multiple machines scattered throughout a network.  However
> there are a lot of was to skin this cat, and the whoper way appears
> quite complex.  I know that google people are not scared of writing a
> lot of code, but in truth, we can get what is likely to be a good
> solution for less than 1000 lines of code, and I believe that that is
> what we should shoot for first.
> 
> Honza and i are a few days away from being able to read in the cgraph
> from each of the .o files, analyze the cgraph and then load the
> functions from those same .o files, AS NECESSARY, for computation.
> 
> It seems that there are two small pieces of code need to be written to
> make a distributed lto:
> 
> 1) A driver that knows what computational resources are available.
> 
> Say you have N processors.  All that this driver has to do is spawn N
> copies of the command line to the N different processors.  Each command
> line only differs in one option.  The index of that processor in the
> group of resources: i.e. "-flto-index n" for n = 0 ... N-1.  If you are
> using an mp, this is done with threads, if on a network it is done with
> ssh commands.
> 
> It will also need to make sure the .o file produced has the n somehow in
> it so that things can be put back together later by the linker.

I might be somewhat biassed, since my mental model here (bit described in
original cgraph document) here has always been to simply produce
separate LTO .o file for each function (at least in initial
implementation) and compile it with usual distribution tools (make -j or
icecream perhaps). I see that for resonable performance it would make
sense to splice it more coarsely and somehow balance the amount of
object files shipped and fact that compilation time is quite
impredictable.

Problem is that .o files will have quite good amount of redundancy:
symbols and types that overlap as well as summaries from IPA analysis,
such as alias analysis for types in question, functions to inline and
such. There is also other problem ....
> 
> 2) The cgraph analyzer, which is the part that figures out what to
> compile next needs to do two additional tasks:
> 
> a) Sort each function by the estimated size (after inlining).  This will
> be the order that functions will be compiled in.  This sort needs to be
> stable and deterministic (both easy constraints).

... at the moment cgraph code sorts things to be compiled in topological
order.  This order allows low level optimizers to propagate info from
caller to callee.  At the moment few simple bits are propagated (as
stack frame requirements), but I believe this is resonable way to get
simple interprocedural register allocation.

This pass is especially irritating by two properties:
1) it is deeply RTL level pass
2) it is important so I don't think we should leave it out completely
(one can come with other cases of IPA optimizations that doesn't fit the
framework)

so I see chance in trying to make clusters of the callgraph of resonable
sizes based on estimated frequencies of calling (similar thing is used
for program layout: you want often called functions near to each other).
This spliced program will get this low level IPA propagation (and
perhaps some IPA optimization we won't fit into the big IPA framework)
locally.

This, naturally, has number of irritating implications, such as more
nodes or better ballnacing would imply worse code.  Alternative would be
to insist on the topological ordering and communicate the information
back from compilation nodes to the compilation manager making nodes to
wait if no function with all callees compiled is available, but that
makes whole thing considerably more complicated.

I would wonder if anyone knows what others are doing here.
>   
> b) Each version of the spawned compilers actually only compiles function
> X if the index in the sort is x and x mod N == the value of
> -flto-index.  The sort is important because it allows for load balancing
> with very little effort.

If we decide to completely give up the propagation of results of late
optimization to other functions (ie IPA register allocation, stack
alignment, GOT pointer loading code removal and such) this seems to make
sense.  I see value in not needing to actually produce new .o files.
Determinism is probably not big issue, we will require same compiler
version etc anyway.

Load ballancing might be issue here: at least my experience from GCC
bootstrap is that all time goes to insn-attrtab that is not majority of
source file but it is huge and drives our optimizers crazy.  Large
projects might always have some such dominating component that will make
some nodes run a lot longer than others.  Also expecting that all nodes
are having same CPU power is not realistic IMO.

It seems to me that this might be IO bound too.  Compile 1GB ooffice
source on 100 machines and you will distribute 100GB along your network.

With the other way you will distribute compilation, get 1GB of
object files (with a lot of header IO, but headers are hopefully just
minotity of course), collect to the linker machine that will do the big
thing once and produce separate object files, get them to nodes and
collect real objects (with assembly) and really link.

So every code gets once as source, twice as LTO object and once as final
object.  But I am not parallelization expert by no means.

Last thing to consider is how to handle with static declarations.  We
need to give them unique names and make them exported from .o file but
not exported from the linked library/binary.  This is not big deal at
least for ELF.
> 
> What this scheme means is that all of the type merging and ipa pass
> analysis will be replicated on all of the processors in the group. 
> However, this will not increase the wall time, only the cpu time (what's
> time to a pig?).  It may and most likely will decrease the wall time
> which is the resource that we are actually most interested in.
> 
> I really do believe that this can be made to work very quickly, and
> unless/until it is shown not to be good enough it is going to be the
> fastest way to get the system up and working on more than one processor.
> Writing out enough state to get things restarted on some other machine
> is going to be a lot of work. 
> 
> Of course all of this depends on our compiler being deterministic.  The
> hashing police may have to be given special emergency powers to enforce
> this, but the truth is we are pretty good about this now.


Honza
> 
> 
> Comments?
> 
> Kenny

Reply via email to