[Bug rtl-optimization/49028] New: Missed optimization of pointer arithmetic

piotr.wyderski at gmail dot com Tue, 17 May 2011 10:27:15 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49028


           Summary: Missed optimization of pointer arithmetic
           Product: gcc
           Version: 4.6.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: rtl-optimization
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: piotr.wyder...@gmail.com


The following examples come from x64, but I believe the problem
shows on other architectures too. 

I have implemented an object recycler based on a circular buffer of pointers
with a single cursor. N is the capacity of the buffer; powers of two are highly
preferred, so let we assume N = 16.


template <unsigned int N> struct R {

    void*  m_Data[N];
    void** m_Cursor;

    void xxx_release(void* p) __attribute__((__noinline__));
};

template <unsigned int N> void R<N>::xxx_release(void* p) {

    m_Cursor = m_Data + ((m_Cursor + 1 - m_Data) % N);
    *m_Cursor = p;
}

int main(int argc, char *argv[]) {      

    R<16> rrr;
    rrr.xxx_release(0);
    return 0;
}

This generates (-O3 -msse -msse2 -msse4.2 -mfpmath=sse -march=core2
-mtune=core2):

000000000041a910 <_ZN1RILj16EE11xxx_releaseEPv>:
  41a910:    48 8b 97 80 00 00 00     mov    0x80(%rdi),%rdx
  41a917:    48 83 c2 08              add    $0x8,%rdx
  41a91b:    48 29 fa                 sub    %rdi,%rdx
  41a91e:    48 89 d0                 mov    %rdx,%rax
  41a921:    48 c1 fa 3f              sar    $0x3f,%rdx
  41a925:    48 c1 ea 3c              shr    $0x3c,%rdx
  41a929:    48 c1 f8 03              sar    $0x3,%rax
  41a92d:    48 01 d0                 add    %rdx,%rax
  41a930:    83 e0 0f                 and    $0xf,%eax
  41a933:    48 29 d0                 sub    %rdx,%rax
  41a936:    48 8d 14 c7              lea    (%rdi,%rax,8),%rdx
  41a93a:    48 89 34 c7              mov    %rsi,(%rdi,%rax,8)
  41a93e:    48 89 97 80 00 00 00     mov    %rdx,0x80(%rdi)
  41a945:    c3                       retq   

The sequence is far from being optimal. The compiler should not move pointer
arithmetic to the type-independent integer domain (i.e. were (p + 1 - p) == 1),
but notice that the actual increment, whis is M = sizeof(void*), is a power of
2, in this case 8, so the final modulo result in integer domain for (p + 1) % N
will be the same as (((char*) p) + M) % (N * M). To cut a long story short:
instead of the above it should just generate:

000000000041a910 <_ZN1RILj16EE11xxx_releaseEPv>:
    mov    0x80(%rdi),%rdx
    add    $0x08, %rdx
    sub    %rdi, %rdx
    and    $0x7f, %rdx
    add    %rdi, %rdx
    mov    %rsi, (%rdx)
    mov    %rdx, 0x80(%rdi)
    retq

I'm not sure I understand what is GCC trying to achieve with the shifts.
Secondly, my example above uses only simple addressing modes, but GCC
uses the most complex mode twice: in lea and in the subsequent mov.
Complex lea has 3 cycles of latency on SandyBridge, according to the
Intel Optimization Manual, p. 3.5.1.3, so it should best be avoided.

[Bug rtl-optimization/49028] New: Missed optimization of pointer arithmetic

Reply via email to