Did you try using FDO with -Os?  FDO should make hot code parts
optimized similar to -O3 but leave other pieces optimized for size.
Using FDO with -O3 gives you the opposite, cold portions optimized
for size while the rest is optimized for speed.

FDO with -Os still optimize for size, even in hot parts.  So to get resonale
speedups you need -O3+FDO. -O3+FDO effectively defaults to -Os in cold portions of program.

Still -Os+FDO should be somewhat faster than -Os alone, so a slowdown is bug. It is not very thoroughly since it is not really used in practice.

Also do you get any warnings on profile mismatches? Perhaps something
is wrong to the degree that the relevant part of profile gets
misapplied.

I don't get any warning on profile mismatches. I only get a "few"
missing gcda files warning, but that's expected.

Perhaps you could compile one of less trivial files you are sure that are covered by train run and send me -fdump-tree-all-blocks -fdump-ipa-all dumps of the compilation so I can double check the profile seems sane. This could be good start to rule out something stupid.

Honza

Cheers,

Mike



Reply via email to