On Thu, Nov 28, 2013 at 6:06 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
>> Dear Teresa and Jan,
>>    I tried to test Teresa's patch, but I've encountered two bugs
>> during usage of -fprofile-generate/use (one in SPEC CPU 2006 and
>> Inkscape).
>
> Thanks, this is non-LTO run. Is there a chance to get -flto version, too?
> So we see how things combine with -freorder-function
>>
>> This will be probably for Jan:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59266
>>
>> second one:
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59265
>>
>> There are numbers I recorded for GIMP with and without block reordering.
>>
>> GIMP (-freorder-blocks-and-partition)
>> pages read (no readahead): 597 pages (4K)
>>
>> GIMP (-no-freorder-blocks-and-partition)
>> pages read (no readahead): 596 pages (4K)
>
> The graphs themselves seems bit odd however, why do we have so many accesses
> to cold section with -fno-reorder-blocks-and-partition again?

Comparing the two graphs I don't see additional accesses in the cold
section from -freorder-blocks-and-partition. For the most part the
graphs look identical. In contrast, the graphs Martin had generated
with and without -freorder-blocks-and-partition back in August had a
significant increase in execution out of text.unlikely.

I'm wondering if the -fno-reorder-blocks-and-partition graph really
had that disabled. I am surprised that the size of the .text and
.text.hot did not shrink from splitting. And the accesses in the cold
section in both graphs look suspiciously like the accesses we ended up
with in the cold section when enabling -freorder-blocks-and-partition
back in Aug (although there are certainly a lot fewer than before,
which is good news).

Martin, can you check that the binary used for
-fno-reorder-blocks-and-partition really doesn't have any splitting?

Thanks,
Teresa

>
> Honza
>>
>> Martin
>>
>> On 19 November 2013 23:18, Teresa Johnson <tejohn...@google.com> wrote:
>> > On Tue, Nov 19, 2013 at 9:40 AM, Jeff Law <l...@redhat.com> wrote:
>> >> On 11/19/13 10:24, Teresa Johnson wrote:
>> >>>
>> >>> On Tue, Nov 19, 2013 at 7:44 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
>> >>>>
>> >>>> Martin,
>> >>>> can you, please, generate the updated systemtap with
>> >>>> -freorder-blocks-and-partition enabled?
>> >>>>
>> >>>> I am in favour of enabling this - it is usefull pass and it is pointless
>> >>>> ot
>> >>>> have passes that are not enabled by default.
>> >>>> Is there reason why this would not work on other ELF target? Is it
>> >>>> working
>> >>>> with Darwin and Windows?
>> >>>
>> >>>
>> >>> I don't know how to test these (I don't see any machines listed in the
>> >>> gcc compile farm of those types). For Windows, I assume you mean
>> >>> MinGW, which should be enabled as it is under i386. Should I disable
>> >>> it there and for Darwin?
>> >>>
>> >>>>
>> >>>>> This patch enables -freorder-blocks-and-partition by default for x86
>> >>>>> at -O2 and up. It is showing some modest gains in cpu2006 performance
>> >>>>> with profile feedback and -O2 on an Intel Westmere system. 
>> >>>>> Specifically,
>> >>>>> I am seeing consistent improvements in 401.bzip2 (1.5-3%), 
>> >>>>> 483.xalancbmk
>> >>>>> (1.5-3%), and 453.povray (2.5-3%), and no apparent regressions.
>> >>>>
>> >>>>
>> >>>> This actually sounds very good ;)
>> >>>>
>> >>>> Lets see how the systemtap graphs goes.  If we will end up with problem
>> >>>> of too many accesses to cold section, I would suggest making cold 
>> >>>> section
>> >>>> subdivided into .unlikely and .unlikely.part (we could have better name)
>> >>>> with the second consisting only of unlikely parts of hot&normal
>> >>>> functions.
>> >>>>
>> >>>> This should reduce the problems we are seeing with mistakely identifying
>> >>>> code to be cold because of roundoff errors (and it probably makes sense
>> >>>> in general, too).
>> >>>> We will however need to update gold and ld for that.
>> >>>
>> >>>
>> >>> Note that I don't think this would help much unless the linker is
>> >>> changed to move the cold split section close to the hot section. There
>> >>> is probably some fine-tuning we could do eventually in the linker
>> >>> under -ffunction-sections without putting the split portions in a
>> >>> separate section. I.e. clump the split parts together within unlikely.
>> >>> But hopefully this can all be done later on as follow-on work to boost
>> >>> the performance further.
>> >>>
>> >>>>>
>> >>>>> Bootstrapped and tested on x86-64-unknown-linux-gnu with a normal
>> >>>>> bootstrap, a profiledbootstrap and an LTO profiledbootstrap. All were
>> >>>>> configured with --enable-languages=all,obj-c++ and tested for both
>> >>>>> 32 and 64-bit with RUNTESTFLAGS="--target_board=unix\{-m32,-m64\}".
>> >>>>>
>> >>>>> It would be good to enable this for additional targets as a follow on,
>> >>>>> but it needs more testing for both correctness and performance on those
>> >>>>> other targets (i.e for correctness because I see a number of places
>> >>>>> in other config/*/*.c files that do some special handling under this
>> >>>>> option for different targets or simply disable it, so I am not sure
>> >>>>> how well-tested it is under different architectural constraints).
>> >>>>>
>> >>>>> Ok for trunk?
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Teresa
>> >>>>>
>> >>>>> 2013-11-19  Teresa Johnson  <tejohn...@google.com>
>> >>>>>
>> >>>>>          * common/config/i386/i386-common.c: Enable
>> >>>>>          -freorder-blocks-and-partition at -O2 and up for x86.
>> >>>>>          * opts.c (finish_options): Only warn if -freorder-blocks-and-
>> >>>>>          partition was set on command line.
>> >>>>
>> >>>>
>> >>>> You probably mis doc/invoke.texi update.
>> >>>> Thank you for working on this!
>> >>>
>> >>>
>> >>> Yes, thanks. Here is the patch with the invoke.texi update.
>> >>>
>> >>> Teresa
>> >>>
>> >>>
>> >>> 2013-11-19  Teresa Johnson  <tejohn...@google.com>
>> >>>
>> >>>          * common/config/i386/i386-common.c: Enable
>> >>>          -freorder-blocks-and-partition at -O2 and up for x86.
>> >>>          * doc/invoke.texi: Update -freorder-blocks-and-partition 
>> >>> default.
>> >>>          * opts.c (finish_options): Only warn if -freorder-blocks-and-
>> >>>          partition was set on command line.
>> >>
>> >> This is fine.  Glad to see we're getting some good benefits from this 
>> >> code.
>> >
>> > Ok, thanks. It is in as r205058.
>> >
>> >>
>> >> FWIW, I would expect the PA to show benefits from this work -- HP showed
>> >> good results for block positioning and procedure splitting eons ago.  A
>> >> secondary effect you would expect to see would be an increase in the 
>> >> number
>> >> of long branch stubs (static count), but a decrease in the number of those
>> >> stubs that actually execute.  Measuring those effects would obviously be
>> >> out-of-scope.
>> >>
>> >> I totally understand that the PA is irrelevant these days, but seeing good
>> >> results on another architecture would go a long way to answering the
>> >> question "should this be enabled by default across the board?"
>> >
>> > Yep, we saw benefits for IPF on hp-ux as well. It would be great to
>> > eventually enable this across the board and only selectively disable.
>> > I know there were people interested in trying this with ARM, that
>> > would be a good relevant arch to try this with now to see if it can be
>> > enabled more widely.
>> >
>> > Teresa
>> >
>> >>
>> >> jeff
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413
>
>
>



-- 
Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413

Reply via email to