Compiling the following code with O2 typedef unsigned long ulong; typedef __SIZE_TYPE__ size_t; long woo_i(long a, long b) { return a/b; }
GCC generates: .LFB0: .cfi_startproc movq %rdi, %rdx movq %rdi, %rax sarq $63, %rdx idivq %rsi ret but both ICC and LLVM generate smaller and faster version: movq %rdi, %rax cqto idivq %rsi ret for reference see http://www.agner.org/optimize/instruction_tables.pdf. On Pentium, the latency of the instruction is 3 cycles while on modern CPUs, the instruction has only one uOp with 1 cycle latency. The following proposed patch fixed the problem. Note that for Atom, only the CWD instruction is slow with 5 cycle latency, the rest sign extension instructions are fast -- the fix for Atom needs finer grain control and can be done separately. Ok to install after testing? Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 193861) +++ config/i386/i386.c (working copy) @@ -1822,7 +1822,7 @@ static unsigned int initial_ix86_tune_fe m_K6, /* X86_TUNE_USE_CLTD */ - ~(m_PENT | m_CORE2I7 | m_ATOM | m_K6 | m_GENERIC), + ~(m_PENT | m_ATOM | m_K6), /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx. */ m_PENT4, 2010-11-30 Xinliang David Li <davi...@google.com> * config/i386/i386.c: Allow sign extend instructions (cltd etc) on modern CPUs. thanks, David