Hi,
core2 used to have quite large penalty for partial flag registers store
done by INCDEC.  This was improved on Sandybridge where extra merging uop
is produced and more at Haswell where there is no extra uop unless there
is instruction accessing both.  For this reason we can use inc/dec again
on modern variants of core.

Bootstrapped/regtested x86_64-linux and tested on Haswell spec2k/spec2k6
with no measurable performance impact.

Honza

        * x86-tune.def (X86_TUNE_USE_INCDEC): Enable for Haswell+.

Index: config/i386/x86-tune.def
===================================================================
--- config/i386/x86-tune.def    (revision 254199)
+++ config/i386/x86-tune.def    (working copy)
@@ -220,10 +220,15 @@ DEF_TUNE (X86_TUNE_LCP_STALL, "lcp_stall
    as "add mem, reg".  */
 DEF_TUNE (X86_TUNE_READ_MODIFY, "read_modify", ~(m_PENT | m_LAKEMONT | m_PPRO))
 
-/* X86_TUNE_USE_INCDEC: Enable use of inc/dec instructions.   */
+/* X86_TUNE_USE_INCDEC: Enable use of inc/dec instructions.
+
+   Core2 and nehalem has stall of 7 cycles for partial flag register stalls.
+   Sandy bridge and Ivy bridge generate extra uop.  On Haswell this extra uop
+   is output only when the values needs to be really merged, which is not
+   done by GCC generated code.  */
 DEF_TUNE (X86_TUNE_USE_INCDEC, "use_incdec",
-          ~(m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_INTEL
-          |  m_KNL | m_KNM | m_GENERIC))
+          ~(m_P4_NOCONA | m_CORE2 | m_NEHALEM  | m_SANDYBRIDGE
+           | m_BONNELL | m_SILVERMONT | m_INTEL |  m_KNL | m_KNM | m_GENERIC))
 
 /* X86_TUNE_INTEGER_DFMODE_MOVES: Enable if integer moves are preferred
    for DFmode copies */

Reply via email to