On 3/13/20 4:11 PM, Jan Hubicka wrote:
$ time g++ -O2 /tmp/gimple-match.ii -c -flto -fno-checking
real 0m8.709s
user 0m8.543s
WPA+LTRANS:
$ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o
gimple-match2.o --param lto-partitions=4 -fno-checking
real 0m11.220s
user 0m33.067s
$ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o
gimple-match2.o --param lto-partitions=6 -fno-checking
real 0m9.880s
user 0m35.599s
$ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o
gimple-match2.o --param lto-partitions=8 -fno-checking
real 0m6.681s
user 0m39.746s
default:
$ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -r -o
gimple-match2.o -fno-checking
real 0m6.065s
user 1m22.698s
I did
/aux/hubicka/trunk-git/build2/./prev-gcc/xg++
-B/aux/hubicka/trunk-git/build2/./prev-gcc/
-B/usr/local/x86_64-pc-linux-gnu/bin/ -nostdinc++
-B/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/src/.libs
-B/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/libsupc++/.libs
-I/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu
-I/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/include
-I/aux/hubicka/trunk-git/libstdc++-v3/libsupc++
-L/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/src/.libs
-L/aux/hubicka/trunk-git/build2/prev-x86_64-pc-linux-gnu/libstdc++-v3/libsupc++/.libs
-fno-PIE -c -g -O2 -fchecking=0 -DIN_GCC -fno-exceptions -fno-rtti
-fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings
-Wcast-qual -Wno-error=format-diag -Wmissing-format-attribute
-Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros
-Wno-overlength-strings -Werror -fno-common -Wno-unused -DHAVE_CONFIG_H -I. -I.
-I../../gcc -I../../gcc/. -I../../gcc/../include -I../../gcc/../libcpp/include
-I/aux/hubicka/trunk-git/build2/./gmp -I/aux/hubicka/trunk-git/gmp
-I/aux/hubicka/trunk-git/build2/./mpfr/src -I/aux/hubicka/trunk-git/mpfr/src
-I/aux/hubicka/trunk-git/mpc/src -I../../gcc/../libdecnumber
-I../../gcc/../libdecnumber/bid -I../libdecnumber -I../../gcc/../libbacktrace
-I/aux/hubicka/trunk-git/build2/./isl/include
-I/aux/hubicka/trunk-git/isl/include -o gimple-match.o -MT gimple-match.o -MMD
-MP -MF ./.deps/gimple-match.TPo gimple-match.c -flto
(copying from build disabling checking and adding -flto) and I get:
hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time
/aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel
gimple-match.o -fno-checking --param lto-partitions=128 -r
real 0m10.394s
user 2m13.809s
sys 0m3.896s
hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time
/aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel
gimple-match.o -fno-checking --param lto-partitions=8 -r
real 0m21.033s
user 2m3.063s
sys 0m2.539s
hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time
/aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel
gimple-match.o -fno-checking --param lto-partitions=6 -r
real 0m23.975s
user 1m56.139s
sys 0m2.595s
hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time
/aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel
gimple-match.o -fno-checking --param lto-partitions=4 -r
real 0m32.383s
user 1m39.411s
sys 0m2.213s
With debug info disabled (like you do, but I guess in less realistic
setting) I get:
hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time
/aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel
gimple-match.o -fno-checking --param lto-partitions=128 -r
real 0m10.905s
user 1m55.065s
sys 0m2.956s
hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time
/aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel
gimple-match.o -fno-checking --param lto-partitions=8 -r
real 0m17.297s
user 1m26.513s
sys 0m1.626s
hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time
/aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel
gimple-match.o -fno-checking --param lto-partitions=6 -r
real 0m22.365s
user 1m30.969s
sys 0m1.386s
hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build2/gcc$ time
/aux/hubicka/trunk-install/bin/gcc -flto=auto -flinker-output=nolto-rel
gimple-match.o -fno-checking --param lto-partitions=4 -r
real 0m26.534s
user 1m21.593s
sys 0m0.902s
So I do not see such notable idfference in user times (but they are
consistently worse than yours). Perhaps, can you try to perf it
including the system profile? It may give us some idea why things behave
differently.
That's strange. So let's take my gimple-match.ii:
https://drive.google.com/file/d/1B8d3bIvz1KA_ksIo8h-JgkaJTCRiSPR4/view?usp=sharing
For gcc9 package (LTO+PGO) I get:
$ time g++ -O2 gimple-match.ii -c -flto
real 0m8.180s
user 0m7.992s
$ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking
--param lto-partitions=4 -r
real 0m9.041s
user 0m28.157s
sys 0m0.493s
$ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking
--param lto-partitions=128 -r
real 0m6.011s
user 1m20.326s
sys 0m2.147s
$ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking -r
real 0m6.303s
user 1m18.789s
sys 0m2.244s
$ time gcc -flto=auto -flinker-output=nolto-rel gimple-match.o -fno-checking
--param lto-partitions=8 -r
real 0m5.875s
user 0m38.938s
sys 0m0.784s
For default I get:
perf report --stdio | head -n30
# To display the perf.data header info, please use --header/--header-only
options.
#
#
# Total Lost Samples: 0
#
# Samples: 351K of event 'cycles:u'
# Event count (approx.): 341558047686
#
# Overhead Command Shared Object Symbol
# ........ ............... ...........................
............................................................................
#
3.61% lto1-ltrans lto1 [.]
df_worklist_dataflow
1.93% lto1-ltrans lto1 [.] cleanup_cfg
1.15% lto1-ltrans lto1 [.]
init_alias_analysis
1.02% lto1-ltrans lto1 [.]
pre_and_rev_post_order_compute_fn
0.93% lto1-ltrans lto1 [.]
calculate_dominance_info
0.84% lto1-ltrans lto1 [.]
inverted_post_order_compute
0.75% lto1-ltrans lto1 [.] post_order_compute
0.71% lto1-ltrans libc-2.31.so [.] _int_malloc
0.69% lto1-ltrans lto1 [.] constrain_operands
0.68% lto1-ltrans lto1 [.] df_bb_refs_record
0.59% lto1-ltrans lto1 [.] side_effects_p
0.53% lto1-ltrans lto1 [.]
delete_unreachable_blocks
0.53% lto1-ltrans lto1 [.]
rewrite_update_dom_walker::before_dom_children
0.49% lto1-ltrans lto1 [.] bitmap_set_bit
0.47% lto1-ltrans lto1 [.]
record_temporary_equivalences
0.46% lto1-ltrans lto1 [.]
single_def_use_dom_walker::before_dom_children
0.46% lto1-ltrans lto1 [.] df_compact_blocks
0.45% lto1-ltrans lto1 [.]
substitute_and_fold_engine::substitute_and_fold
0.45% lto1-ltrans libc-2.31.so [.] _int_free
Martin
Compiler binary I use is profiledbootstrapped with LTO.
Honza
So I would recommend to set the param value to 75000, which leads to 6
partitions. That would be:
9+10s = 19s vs. 40s (total real time 44s). That seems reasonable to me.
Thoughts?
Thanks,
Martin
gcc/ChangeLog:
2020-03-13 Martin Liska <mli...@suse.cz>
* params.opt: Bump min-lto-partition in order to not create
too many LTRANS.
---
gcc/params.opt | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/gcc/params.opt b/gcc/params.opt
index e39216aa7d0..49fafac20af 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -363,7 +363,7 @@ Common Joined UInteger
Var(param_max_lto_streaming_parallelism) Init(32) Integer
maximal number of LTO partitions streamed in parallel.
-param=lto-min-partition=
-Common Joined UInteger Var(param_min_partition_size) Init(10000) Param
+Common Joined UInteger Var(param_min_partition_size) Init(75000) Param
Minimal size of a partition for LTO (in estimated instructions).
-param=lto-partitions=