Re: FDO and LTO on ARM

Jan Hubicka Fri, 05 Aug 2011 07:40:49 -0700

Am Fri 05 Aug 2011 09:32:05 AM CEST schrieb Richard Guenther<richard.guent...@gmail.com>:

On Thu, Aug 4, 2011 at 8:42 PM, Jan Hubicka <j...@suse.de> wrote:

Did you try using FDO with -Os?  FDO should make hot code parts
optimized similar to -O3 but leave other pieces optimized for size.
Using FDO with -O3 gives you the opposite, cold portions optimized
for size while the rest is optimized for speed.


FDO with -Os still optimize for size, even in hot parts.


I don't think so.  Or at least that would be a bug.  Shouldn't 'hot'
BBs/functions
be optimized for speed even at -Os?  Hm, I see predict.c indeed returns
always false for optimize_size :(

It was outcome of discussion held some time ago. I think it was Markpromoting point that users opitmize for size when they use -Os period.


I thought we had just the neither cold or hot parts optimized according

to optimize_size. I originally wanted to have attribute HOT tooverwrite -Os, so the well annotateed sources (i.e. kernel) couldcompile with -Os by default and explicitely declare the hot parts hotand get them compiled appropriately.

With profile feedback however the current logic is binary - i.e.blocks are either hot since their count is bigger than the thresholdor cold. We don't really have "I don't really know" state there. Insome cases it would make sense - i.e. there are optimizations that wewant to do only in the hottest parts of code, but we don't have anylogic for that.

My plan is to extend ipa-profile to do better hot/cold partitioningfirst: at the moment we decide on fixed fraction of maximal count inthe program. This is unnecesarily conservative for programs with notterribly flat profiles. At IPA level we could collect histogram ofcounts of instructions (i.e. figure out how much time we spend oninstructions executed N times) and then figure out where is thethreshold so 99% of executed instructions belongs to hot region. Thisshould give noticeably smaller binaries.


I thought we had just the neither cold or hot parts optimized according
to optimize_size.

 So to get resonale
speedups you need -O3+FDO.  -O3+FDO effectively defaults to -Os in cold
portions of program.


Well, but unless your training coverage is 100% all parts with no coverage
get optimized with -O3 instead of -Os.  And I bet coverage for mozilla
isn't even close to 100%.  Thus I think recommending -O3 for FDO is
usually a bad idea.

Code with no coverage is cold in our model (as is code executed onceor so) and thus optimized for -Os even at -O3+FDO. This is bitaggressive on optimizing for size side. We might consider changingthis policy, but so far I didn't see any complains on this...


Honza


So - did you try FDO with -O2? ;)

Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is
bug.  It is not very thoroughly since it is not really used in practice.

Also do you get any warnings on profile mismatches? Perhaps something
is wrong to the degree that the relevant part of profile gets
misapplied.


I don't get any warning on profile mismatches. I only get a "few"
missing gcda files warning, but that's expected.


Perhaps you could compile one of less trivial files you are sure that are
covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all dumps
of the compilation so I can double check the profile seems sane. This could
be good start to rule out something stupid.

Honza


Cheers,

Mike

Re: FDO and LTO on ARM

Reply via email to