On Wed, Jul 9, 2014 at 10:58 AM, Jan Hubicka <hubi...@ucw.cz> wrote: > Hello, > perhaps I could write bit more on my longer term plans. At the moment 30% of > firefox WPA is taken > by straming trees and another roughly 30% is taken by inliner. It is bit > anoying but relatively > easy to optimize inliner, but trees represent bigger problem. > > According to the stats average tree is streamed in 20times and according to > perf we spend about 1/4th > by unpacking the sections and then actual read of fields & SCC unification > dominates.
This part can probably be speed up quite a bit by doing the SCC unification before materializing the SCC, that is, doing the "on-disk" format compare idea. The issue here is that for bigger SCCs that have hash collisions in their entries we need to do the edge walk - but eventually having two paths, one for SCCs with all distinct entry hashes doing the on-disk format compare and one doing the compare after materialization is possible (at least we could do some statistics). I hope to get to this after the Cauldron (but I also have a half-way-done re-org on how we do the compression to save memory that I want to finish at some point...) > At low level, > tree streaming is already pretty well optimized. > > I started to look into the following: > > 1) putting types&decls on diet > > I started to move individual fields into more fitting locations, getting > rid of > one field for many different reasons. I am trying to do this incrementally > and keeping about one field per week flow. Currenlty I am stuck at: > > https://gcc.gnu.org/ml/gcc-patches/2014-06/msg01969.html > > (moving DECL_ARGUMENTS). The plan is to > - get rid of decl_with_vis. I removed all fields except for symbol table > pointer > (that will stay) and some flags I plan to handle soon - comdat, weak and > visibility. > The last one is harder because C++ FE uses it on type declarations, but > it is almost done). > The rest of flags (few variable/function specific items that has nothing > to do with visibility) can go into decl_common where it is enough of > space. > - get rid of decl_non_common > Here I need to move arguments and results. Have patches for both. > - I plan to do the same on type side - decompose TYPE_NON_COMMON in favour > of explicit type > hiearchy. > - experiment with getting rid of RTL pointer > > I plan to test moving DECL_RTL into on-side tables (one global for > global RTLs and one local for per-functoin RTLs). This should get us > closer moving RTL > into per-function storage again and make RTL easier to reclaim. > - Once done with these I can recast the inheritance to have DATA_TYPE and > DATA_DECL > that is common base of types/decls that do have data associated with > them. Those can > cary mode, sizes, alias info that is not needed for functions, labels, > type declarations > etc. > > I also wonder if we need canonical types for FUNCTION_TYPE/METHOD_TYPE > and other thing > that is not associated with readable data. > > This has bit of multiple inheritance issues (that I do not want to > introduce), > since we have decls with symbol table and decls with data. I think > simple union > for that single symtab pointer will do. In fact I already tested > restricting > DECL_SIZE&friends to decls with data, but there is a lot of frontend > updating to do, > as these fields are overriden for many of the FE declarations. (it is > reason why I > added FE machinery to allow custom memory storage for newly added ecls > in the patch above) > > Naturally this is good from maintenance point of view, it has potential to > reduce memory > footprint, streaming size, improve mergeability of trees (if definition > and external declarations > looks the same in tree decls, we will merge more type variants, because > currently we keep class types > in two copies, one for unit definig them and other for units using them) > and also avoid > stremaing of stale pointers, but it is a slow progress and the direct > benefits are limited. Splitting variant types and main variants up also would be a big saver but interesting from an inheritance perspective. Note that trying to get somewhere with debug info and LTO would also make a lot of things no longer necessary to stream. To summarize a recent IRC discussion a working plan is to - emit debug info for types and decls at the compile-stage, producing one debug object per source TU (or emit the debug info into the (slim) LTO objects and play linker tricks later); add extra hidden symbols to be able to refer to decls and record that symbols on-the-side (and stream them to the LTO data) - at WPA / LTRANS phase materialize the debug symbol references in decl-lang-specific - at LTRANS phase emit debug info for the actual functions refering to the already output declarations via abstract origins. For that to work we need to DW_tag_import their containers. An LTRANS file creates a single DW_tag_compilation_unit (all other units are partially "inlined" into it). - we link the LTRANS objects and the per-TU debug objects in the final link. that should get us to the point where most of the langhook issues in dwarf2out.c for LTO are gone (most of, not all I fear ...). And it should enable us to avoid streaming type details (in theory we only need to preserve the type structure, a bit more in detail than we do for canonical type merging). Note that all this applies to non-LTO as well - if we emit the debug info before gimplification we can finally introduce sth like a "gimple" type. > 2) put BINFOs on diet > > BINFOs are currently added to every class type. We can drop them in case > they do > not hold useful information for devirtualization neither debug info. This > is now > quite well defined. Main offender is ipa-prop that still uses > get_binfo_at_offset > and walks binfos it should not. I am working on it. > > 3) ODR type merging > > I have patches for this, but want to go bit curefuly - I need to discuss > with Jason > the anonymous types and get code for checking ODR violations working well. > > Basically for ODR types I can merge variant lists that results in leaner > debug info > and bit less of streaming WPA->ltrans > It is also important for type propagation and I have prototype to handle > canonical types > of ODR and anonymous types specially. > > This actually increases LTO stream sizes (uncompressed) by about 6% to > stream explicit > mangled names. My 4.10 with the patch is still faster than 4.9 but > definitely would be > happier if there was easier way around I think it would be better to do the debug stuff first to see what we really need. > 4) Reduce size of LTO streams > > This is what I was shooting for with the variant streaming (in addition to > have sanity checker > for 3 as bugs in these may turn types into a crazy soup quite easily). > Types and decls are most common things to stream, 50% of types are > variants, so not streaming > duplicated data in variants has chance to save about 30-40% of type > storage. > Decls inherits some stuff from types (99% of time), like DECL_SIZE and > friends. Yeah, I can see that. Though I'd rather organize the streaming in a way the tree hierarchy ideally would look like. That is, stream a new LTO_type_variant kind, output a reference to the main variant and what is changed, for example as n_changes, { change1, chage2, ... }, for example 1, QUAL_CONST for a const qualified vairant. Rather than adding all those noisy checks all over the place. The above is also easier to compare in the future on-disk compare stage. > In my tests I went from compression ration over 3 to 2.1 keeping about the > same gzipped > data - so this speeds up unpacking & rebuilding trees, since direct copies > are faster than > LTO streamer table lookups. > > 5) Avoid merging of unmergeable things > > This is the patch that drops hashtable to 1 for things where we know we do > not want to merge. > This is needed for correctness of ODR types and it also improves > compression ration of the > streams as SCC hashes are hard to gzip. Yeah, that's indeed a good improvement we should get back to (finally compute mergeability and indexability in a sane and consistent way) > 6) Put variable initializers into named sections (as function bodies) > > This is supposed to help vtables, but I am always too lazy to dive into > details of our > ugly low level section API. That would be indeed nice. > 7) Improve streaming of locations, as discussed several times. Again I am > bit discouraged > but need to make extra section etc. Location lookup still shows high in > the profile. Likewise nice. Any chance somebody would start on that debug info thing? I guess we can sit down at the cauldron for that. Richard. > So some of my immediate plans. > Honza