On Wed, Apr 2, 2014 at 6:11 PM, Martin Liška <mli...@suse.cz> wrote:
On 04/02/2014 04:13 PM, Martin Liška wrote:
On 03/27/2014 10:48 AM, Martin Liška wrote:
Previous patch is wrong, I did a mistake in name ;)
Martin
On 03/27/2014 09:52 AM, Martin Liška wrote:
On 03/25/2014 09:50 PM, Jan Hubicka wrote:
Hello,
I've been compiling Chromium with LTO and I noticed
that WPA
stream_out forks and do parallel:
http://gcc.gnu.org/ml/gcc-patches/2013-11/msg02621.html.
I am unable to fit in 16GB memory: ld uses about 8GB and lto1
about
6GB. When WPA start to fork, memory consumption increases so
that
lto1 is killed. I would appreciate an --param option to
disable this
WPA fork. The number of forks is taken from build system
(-flto=9)
which is fine for ltrans phase, because LD releases
aforementioned
8GB.
What do you think about that?
I can take a look - our measurements suggested that the WPA
memory
will
be later dominated by ltrans. Perhaps Chromium does something
that
makes
WPA to explode that would be interesting to analyze. I did not
managed
to get through Chromium LTO build process recently (ninja
builds are
not
my friends), can you send me the instructions?
Honza
Thanks,
Martin
There are instructions how can one build chromium with LTO:
1) install depot-tools and export PATH variable according to
guide:
http://www.chromium.org/developers/how-tos/install-depot-tools
2) Checkout source code: gclient sync; cd src
3) Apply patch (enables system gold linker and disables LTO for a
sandbox that uses top-level asm)
4) which ld should point to ld.gold
5) unsure that ld.bfd points to ld.bfd
6) run: build/gyp_chromium -Dwerror=
7) ninja -C out/Release chrome -jX
If there are any problems, follow:
https://code.google.com/p/chromium/wiki/LinuxBuildInstructions
Martin
Hello,
taking latest trunk gcc, I built Firefox and Chromium. Both
projects
compiled without debugging symbols and -O2 on an 8-core machine.
Firefox:
-flto=9, peak memory usage (in LTRANS): 11GB
Chromium:
-flto=6, peak memory usage (in parallel WPA phase ): 16.5GB
For details please see attached with graphs. The attachment contains
also
-fmem-report and -fmem-report-wpa.
I think reduced memory footprint to ~3.5GB is a bit optimistic:
http://gcc.gnu.org/gcc-4.9/changes.html
Is there any way we can reduce the memory footprint?
Attachment (due to size restriction):
https://drive.google.com/file/d/0B0pisUJ80pO1bnV5V0RtWXJkaVU/edit?usp=sharing
Thank you,
Martin
Previous email presents a bit misleading graphs (influenced by
--enable-gather-detailed-mem-stats).
Firefox:
-flto=9, WPA peak: 8GB, LTRANS peak: 8GB
-flto=4, WPA peak: 5GB, LTRANS peak: 3.5GB
-flto=1, WPA peak: 3.5GB, LTRANS peak: ~1GB
These data shows that parallel WPA streaming increases short-time
memory
footprint by 4.5GB for -flto=9 (respectively by 1.5GB in case of
-flto=4).
For more details, please see the attachment.
The main overhead comes from maintaining the state during output of
the global types/decls. We maintain somewhat "duplicate" info
here by having both the tree_ref_encoder and the streamer cache.
Eventually we can free the tree_ref_encoder pointer-map early, like
with
Index: lto-streamer-out.c
===================================================================
--- lto-streamer-out.c (revision 209018)
+++ lto-streamer-out.c (working copy)
@@ -2423,10 +2455,18 @@ produce_asm_for_decls (void)
gcc_assert (!alias_pairs);
- /* Write the global symbols. */
+ /* Get rid of the global decl state hash tables to save some
memory.
*/
out_state = lto_get_out_decl_state ();
- num_fns = lto_function_decl_states.length ();
+ for (int i = 0; i < LTO_N_DECL_STREAMS; i++)
+ if (out_state->streams[i].tree_hash_table)
+ {
+ delete out_state->streams[i].tree_hash_table;
+ out_state->streams[i].tree_hash_table = NULL;
+ }
+
+ /* Write the global symbols. */
lto_output_decl_state_streams (ob, out_state);
+ num_fns = lto_function_decl_states.length ();
for (idx = 0; idx < num_fns; idx++)
{
fn_out_state =
as we do already for the fn state streams (untested).
we can also avoid re-allocating the output hashtable/vector by, after
(or in) create_output_block, allocate a bigger initial size for the
streamer_tree_cache. Note that the pointer-set already expands if
the fill level is > 25%, and it really exponentially grows (similar to
hash_table, btw, but that grows only at 75% fill level).
OTOH simply summing then lengths of all decl streams results in
a lower value than the actual number of output trees in the output
block.
Humm.
But this is clearly the data structure that could be worth optimizing
in some way. For example during writing we don't need the
streamer cache nodes array (we just need a counter to assign indexes).
The attached is a patch that tries to do that plus the above (in
testing
right now). Maybe you can check if it makes a noticable difference.
Richard.