Am Fri 05 Aug 2011 09:32:05 AM CEST schrieb Richard Guenther
<richard.guent...@gmail.com>:
On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka <j...@suse.de> wrote:
Did you try using FDO with -Os? FDO should make hot code parts
optimized similar to -O3 but leave other pieces optimized for size.
Using FDO with -O3 gives you the opposite, cold portions optimized
for size while the rest is optimized for speed.
FDO with -Os still optimize for size, even in hot parts.
I don't think so. Or at least that would be a bug. Shouldn't 'hot'
BBs/functions
be optimized for speed even at -Os? Hm, I see predict.c indeed returns
always false for optimize_size :(
It was outcome of discussion held some time ago. I think it was Mark
promoting point that users opitmize for size when they use -Os period.
I thought we had just the neither cold or hot parts optimized according
to optimize_size. I originally wanted to have attribute HOT to
overwrite -Os, so the well annotateed sources (i.e. kernel) could
compile with -Os by default and explicitely declare the hot parts hot
and get them compiled appropriately.
With profile feedback however the current logic is binary - i.e.
blocks are either hot since their count is bigger than the threshold
or cold. We don't really have "I don't really know" state there. In
some cases it would make sense - i.e. there are optimizations that we
want to do only in the hottest parts of code, but we don't have any
logic for that.
My plan is to extend ipa-profile to do better hot/cold partitioning
first: at the moment we decide on fixed fraction of maximal count in
the program. This is unnecesarily conservative for programs with not
terribly flat profiles. At IPA level we could collect histogram of
counts of instructions (i.e. figure out how much time we spend on
instructions executed N times) and then figure out where is the
threshold so 99% of executed instructions belongs to hot region. This
should give noticeably smaller binaries.
I thought we had just the neither cold or hot parts optimized according
to optimize_size.
So to get resonale
speedups you need -O3+FDO. -O3+FDO effectively defaults to -Os in cold
portions of program.
Well, but unless your training coverage is 100% all parts with no coverage
get optimized with -O3 instead of -Os. And I bet coverage for mozilla
isn't even close to 100%. Thus I think recommending -O3 for FDO is
usually a bad idea.
Code with no coverage is cold in our model (as is code executed once
or so) and thus optimized for -Os even at -O3+FDO. This is bit
aggressive on optimizing for size side. We might consider changing
this policy, but so far I didn't see any complains on this...
Honza
So - did you try FDO with -O2? ;)
Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is
bug. It is not very thoroughly since it is not really used in practice.
Also do you get any warnings on profile mismatches? Perhaps something
is wrong to the degree that the relevant part of profile gets
misapplied.
I don't get any warning on profile mismatches. I only get a "few"
missing gcda files warning, but that's expected.
Perhaps you could compile one of less trivial files you are sure that are
covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all dumps
of the compilation so I can double check the profile seems sane. This could
be good start to rule out something stupid.
Honza
Cheers,
Mike