Re: Optimize type streaming

Richard Biener Wed, 09 Jul 2014 03:49:35 -0700

On Wed, Jul 9, 2014 at 10:58 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
> Hello,
> perhaps I could write bit more on my longer term plans.  At the moment 30% of 
> firefox WPA is taken
> by straming trees and another roughly 30% is taken by inliner.  It is bit 
> anoying but relatively
> easy to optimize inliner, but trees represent bigger problem.
>
> According to the stats average tree is streamed in 20times and according to 
> perf we spend about 1/4th
> by unpacking the sections and then actual read of fields & SCC unification 
> dominates.


This part can probably be speed up quite a bit by doing the SCC unification
before materializing the SCC, that is, doing the "on-disk" format compare idea.
The issue here is that for bigger SCCs that have hash collisions in their
entries we need to do the edge walk - but eventually having two paths, one
for SCCs with all distinct entry hashes doing the on-disk format compare
and one doing the compare after materialization is possible (at least we
could do some statistics).  I hope to get to this after the Cauldron (but I
also have a half-way-done re-org on how we do the compression to save
memory that I want to finish at some point...)

>  At low level,
> tree streaming is already pretty well optimized.
>
> I started to look into the following:
>
> 1) putting types&decls on diet
>
>    I started to move individual fields into more fitting locations, getting 
> rid of
>    one field for many different reasons.  I am trying to do this incrementally
>    and keeping about one field per week flow. Currenlty I am stuck at:
>
>    https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01969.html
>
>    (moving DECL_ARGUMENTS). The plan is to
>    - get rid of decl_with_vis.  I removed all fields except for symbol table 
> pointer
>      (that will stay) and some flags I plan to handle soon - comdat, weak and 
> visibility.
>      The last one is harder because C++ FE uses it on type declarations, but 
> it is almost done).
>      The rest of flags (few variable/function specific items that has nothing
>      to do with visibility) can go into decl_common where it is enough of 
> space.
>    - get rid of decl_non_common
>      Here I need to move arguments and results. Have patches for both.
>    - I plan to do the same on type side - decompose TYPE_NON_COMMON in favour 
> of explicit type
>      hiearchy.
>    - experiment with getting rid of RTL pointer
>
>      I plan to test moving DECL_RTL into on-side tables (one global for
>      global RTLs and one local for per-functoin RTLs). This should get us 
> closer moving RTL
>      into per-function storage again and make RTL easier to reclaim.
>    - Once done with these I can recast the inheritance to have DATA_TYPE and 
> DATA_DECL
>      that is common base of types/decls that do have data associated with 
> them.  Those can
>      cary mode, sizes, alias info that is not needed for functions, labels, 
> type declarations
>      etc.
>
>      I also wonder if we need canonical types for FUNCTION_TYPE/METHOD_TYPE 
> and other thing
>      that is not associated with readable data.
>
>      This has bit of multiple inheritance issues (that I do not want to 
> introduce),
>      since we have decls with symbol table and decls with data.  I think 
> simple union
>      for that single symtab pointer will do.  In fact I already tested 
> restricting
>      DECL_SIZE&friends to decls with data, but there is a lot of frontend 
> updating to do,
>      as these fields are overriden for many of the FE declarations.  (it is 
> reason why I
>      added FE machinery to allow custom memory storage for newly added ecls 
> in the patch above)
>
>    Naturally this is good from maintenance point of view, it has potential to 
> reduce memory
>    footprint, streaming size, improve mergeability of trees (if definition 
> and external declarations
>    looks the same in tree decls, we will merge more type variants, because 
> currently we keep class types
>    in two copies, one for unit definig them and other for units using them) 
> and also avoid
>    stremaing of stale pointers, but it is a slow progress and the direct 
> benefits are limited.

Splitting variant types and main variants up also would be a big saver but
interesting from an inheritance perspective.

Note that trying to get somewhere with debug info and LTO would also make
a lot of things no longer necessary to stream.  To summarize a recent IRC
discussion a working plan is to

 - emit debug info for types and decls at the compile-stage, producing one
   debug object per source TU (or emit the debug info into the (slim)
LTO objects
   and play linker tricks later); add extra hidden symbols to be able to refer
   to decls and record that symbols on-the-side (and stream them to
the LTO data)

 - at WPA / LTRANS phase materialize the debug symbol references in
   decl-lang-specific

 - at LTRANS phase emit debug info for the actual functions refering to the
   already output declarations via abstract origins.  For that to work we need
   to DW_tag_import their containers.  An LTRANS file creates
   a single DW_tag_compilation_unit (all other units are partially
"inlined" into it).

 - we link the LTRANS objects and the per-TU debug objects in the final link.

that should get us to the point where most of the langhook issues in dwarf2out.c
for LTO are gone (most of, not all I fear ...).  And it should enable
us to avoid
streaming type details (in theory we only need to preserve the type structure,
a bit more in detail than we do for canonical type merging).

Note that all this applies to non-LTO as well - if we emit the debug info
before gimplification we can finally introduce sth like a "gimple" type.

> 2) put BINFOs on diet
>
>    BINFOs are currently added to every class type.  We can drop them in case 
> they do
>    not hold useful information for devirtualization neither debug info.  This 
> is now
>    quite well defined.  Main offender is ipa-prop that still uses 
> get_binfo_at_offset
>    and walks binfos it should not.  I am working on it.
>
> 3) ODR type merging
>
>    I have patches for this, but want to go bit curefuly - I need to discuss 
> with Jason
>    the anonymous types and get code for checking ODR violations working well.
>
>    Basically for ODR types I can merge variant lists that results in leaner 
> debug info
>    and bit less of streaming WPA->ltrans
>    It is also important for type propagation and I have prototype to handle 
> canonical types
>    of ODR and anonymous types specially.
>
>    This actually increases LTO stream sizes (uncompressed) by about 6% to 
> stream explicit
>    mangled names.  My 4.10 with the patch is still faster than 4.9 but 
> definitely would be
>    happier if there was easier way around

I think it would be better to do the debug stuff first to see what we
really need.

> 4) Reduce size of LTO streams
>
>    This is what I was shooting for with the variant streaming (in addition to 
> have sanity checker
>    for 3 as bugs in these may turn types into a crazy soup quite easily).
>    Types and decls are most common things to stream, 50% of types are 
> variants, so not streaming
>    duplicated data in variants has chance to save about 30-40% of type 
> storage.
>    Decls inherits some stuff from types (99% of time), like DECL_SIZE and 
> friends.

Yeah, I can see that.  Though I'd rather organize the streaming in a way
the tree hierarchy ideally would look like.  That is, stream a new
LTO_type_variant kind, output a reference to the main variant and what
is changed, for example as n_changes, { change1, chage2, ... }, for example
1, QUAL_CONST for a const qualified vairant.

Rather than adding all those noisy checks all over the place.

The above is also easier to compare in the future on-disk compare stage.

>    In my tests I went from compression ration over 3 to 2.1 keeping about the 
> same gzipped
>    data - so this speeds up unpacking & rebuilding trees, since direct copies 
> are faster than
>    LTO streamer table lookups.
>
> 5) Avoid merging of unmergeable things
>
>    This is the patch that drops hashtable to 1 for things where we know we do 
> not want to merge.
>    This is needed for correctness of ODR types and it also improves 
> compression ration of the
>    streams as SCC hashes are hard to gzip.

Yeah, that's indeed a good improvement we should get back to (finally
compute mergeability and indexability in a sane and consistent way)

> 6) Put variable initializers into named sections (as function bodies)
>
>    This is supposed to help vtables, but I am always too lazy to dive into 
> details of our
>    ugly low level section API.

That would be indeed nice.

> 7) Improve streaming of locations, as discussed several times.  Again I am 
> bit discouraged
>    but need to make extra section etc.  Location lookup still shows high in 
> the profile.

Likewise nice.

Any chance somebody would start on that debug info thing?  I guess we can
sit down at the cauldron for that.

Richard.

> So some of my immediate plans.
> Honza

Re: Optimize type streaming

Reply via email to