On Fri, Sep 27, 2013 at 8:36 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >> On Fri, Sep 27, 2013 at 1:56 AM, Jan Hubicka <hubi...@ucw.cz> wrote: >> > Hi, >> > this is second part of the generic tuning changes sanityzing the tuning >> > flags. >> > This patch again is supposed to deal with the "obvious" part only. >> > I will send separate patch for more changes. >> > >> > The flags changed agree on all CPUs considered for generic (and their >> > optimization manuals) + amdfam10, core2 and Atom SLM. >> > >> > I also added X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL to bobcat tuning, since >> > it >> > seems like obvious omision (after double checking in optimization manual) >> > and >> > droped X86_TUNE_FOUR_JUMP_LIMIT for buldozer cores. Implementation of this >> > feature was always bit weird and its main purpose was to avoid terrible >> > branch >> > predictor degeneration on the older AMD branch predictors. I benchmarked >> > both >> > spec2k and 2k6 to verify there are no regression. >> > >> > Especially X86_TUNE_REASSOC_FP_TO_PARALLEL seems to bring nice >> > improvements in specfp >> > benchmarks. >> > >> > Bootstrapped/regtested x86_64-linux, will wait for comments and commit it >> > during weekend. I will be happy to revisit any of the generic tuning if >> > regressions pop up. >> > >> > Overall this patch also brings small code size improvements for smaller >> > loads/stores and less padding at -O2. Differences are sub 0.1% however. >> > >> > Honza >> > * x86-tune.def (X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL): Enable for >> > generic. >> > (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL): Likewise. >> > (X86_TUNE_FOUR_JUMP_LIMIT): Drop for generic and buldozer. >> > (X86_TUNE_PAD_RETURNS): Drop for newer AMD chips. >> >> Can we drop generic on X86_TUNE_PAD_RETURNS? > It is on my list for not-so-obvious changes. I tested and removed it from > BDVER with intention to drop it from generic. But after furhter testing I lean > towards keeping it for some extra time. > > I tested it on fam10 machines and it causes over 10% regressions on some > benchmarks, including bzip and botan (where it is up to 4-fold regression). > Missing a return on amdfam10 hardware is bad, because it causes return stack > to > go out of sync. At the same time I can not really measure benefits for > disabling it - the code size cost is very small and runtime cost on > non-amdfam10 cores is not important, too, since the function call overhead > hide > the extra nop quite easily.
I see. > So I would incline to be apply extra care on this flag and keep it for extra > release or two. Most of gcc.opensuse.org testing runs on these and adding > random branch mispredictions will trash them. > > At the related note, would would you think of X86_TUNE_PARTIAL_FLAG_REG_STALL? > I benchmarked it on my I5 notebook and it seems to have no measurable effects > on spec2k6. > > I also did some benchmarking of the patch to disable alignments you proposed. > Unforutnately I can measure slowdowns on fam10/bdver/and on botan/hand written > loops even for core. I am not surprised about hand written loops. Have you tried SPEC CPU rate? > I am considering to drop the branch target/function alignment and keep only > loop > alignment, but I did not test this yet. > > Honza -- H.J.