Re: Generic tuning in x86-tune.def 1/2

H.J. Lu Fri, 27 Sep 2013 08:47:33 -0700

On Fri, Sep 27, 2013 at 8:36 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
>> On Fri, Sep 27, 2013 at 1:56 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
>> > Hi,
>> > this is second part of the generic tuning changes sanityzing the tuning 
>> > flags.
>> > This patch again is supposed to deal with the "obvious" part only.
>> > I will send separate patch for more changes.
>> >
>> > The flags changed agree on all CPUs considered for generic (and their
>> > optimization manuals) + amdfam10, core2 and Atom SLM.
>> >
>> > I also added X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL to bobcat tuning, since 
>> > it
>> > seems like obvious omision (after double checking in optimization manual) 
>> > and
>> > droped X86_TUNE_FOUR_JUMP_LIMIT for buldozer cores.  Implementation of this
>> > feature was always bit weird and its main purpose was to avoid terrible 
>> > branch
>> > predictor degeneration on the older AMD branch predictors. I benchmarked 
>> > both
>> > spec2k and 2k6 to verify there are no regression.
>> >
>> > Especially X86_TUNE_REASSOC_FP_TO_PARALLEL seems to bring nice 
>> > improvements in specfp
>> > benchmarks.
>> >
>> > Bootstrapped/regtested x86_64-linux, will wait for comments and commit it
>> > during weekend.  I will be happy to revisit any of the generic tuning if
>> > regressions pop up.
>> >
>> > Overall this patch also brings small code size improvements for smaller
>> > loads/stores and less padding at -O2. Differences are sub 0.1% however.
>> >
>> > Honza
>> >         * x86-tune.def (X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL): Enable for 
>> > generic.
>> >         (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL): Likewise.
>> >         (X86_TUNE_FOUR_JUMP_LIMIT): Drop for generic and buldozer.
>> >         (X86_TUNE_PAD_RETURNS): Drop for newer AMD chips.
>>
>> Can we drop generic on X86_TUNE_PAD_RETURNS?
> It is on my list for not-so-obvious changes.  I tested and removed it from
> BDVER with intention to drop it from generic. But after furhter testing I lean
> towards keeping it for some extra time.
>
> I tested it on fam10 machines and it causes over 10% regressions on some
> benchmarks, including bzip and botan (where it is up to 4-fold regression).
> Missing a return on amdfam10 hardware is bad, because it causes return stack 
> to
> go out of sync. At the same time I can not really measure benefits for
> disabling it - the code size cost is very small and runtime cost on
> non-amdfam10 cores is not important, too, since the function call overhead 
> hide
> the extra nop quite easily.


I see.

> So I would incline to be apply extra care on this flag and keep it for extra
> release or two. Most of gcc.opensuse.org testing runs on these and adding
> random branch mispredictions will trash them.
>
> At the related note, would would you think of X86_TUNE_PARTIAL_FLAG_REG_STALL?
> I benchmarked it on my I5 notebook and it seems to have no measurable effects
> on spec2k6.
>
> I also did some benchmarking of the patch to disable alignments you proposed.
> Unforutnately I can measure slowdowns on fam10/bdver/and on botan/hand written
> loops even for core.

I am not surprised about hand written loops.  Have you
tried SPEC CPU rate?

> I am considering to drop the branch target/function alignment and keep only 
> loop
> alignment, but I did not test this yet.
>
> Honza



-- 
H.J.

Re: Generic tuning in x86-tune.def 1/2

Reply via email to