http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49028
Summary: Missed optimization of pointer arithmetic Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: rtl-optimization AssignedTo: unassig...@gcc.gnu.org ReportedBy: piotr.wyder...@gmail.com The following examples come from x64, but I believe the problem shows on other architectures too. I have implemented an object recycler based on a circular buffer of pointers with a single cursor. N is the capacity of the buffer; powers of two are highly preferred, so let we assume N = 16. template <unsigned int N> struct R { void* m_Data[N]; void** m_Cursor; void xxx_release(void* p) __attribute__((__noinline__)); }; template <unsigned int N> void R<N>::xxx_release(void* p) { m_Cursor = m_Data + ((m_Cursor + 1 - m_Data) % N); *m_Cursor = p; } int main(int argc, char *argv[]) { R<16> rrr; rrr.xxx_release(0); return 0; } This generates (-O3 -msse -msse2 -msse4.2 -mfpmath=sse -march=core2 -mtune=core2): 000000000041a910 <_ZN1RILj16EE11xxx_releaseEPv>: 41a910: 48 8b 97 80 00 00 00 mov 0x80(%rdi),%rdx 41a917: 48 83 c2 08 add $0x8,%rdx 41a91b: 48 29 fa sub %rdi,%rdx 41a91e: 48 89 d0 mov %rdx,%rax 41a921: 48 c1 fa 3f sar $0x3f,%rdx 41a925: 48 c1 ea 3c shr $0x3c,%rdx 41a929: 48 c1 f8 03 sar $0x3,%rax 41a92d: 48 01 d0 add %rdx,%rax 41a930: 83 e0 0f and $0xf,%eax 41a933: 48 29 d0 sub %rdx,%rax 41a936: 48 8d 14 c7 lea (%rdi,%rax,8),%rdx 41a93a: 48 89 34 c7 mov %rsi,(%rdi,%rax,8) 41a93e: 48 89 97 80 00 00 00 mov %rdx,0x80(%rdi) 41a945: c3 retq The sequence is far from being optimal. The compiler should not move pointer arithmetic to the type-independent integer domain (i.e. were (p + 1 - p) == 1), but notice that the actual increment, whis is M = sizeof(void*), is a power of 2, in this case 8, so the final modulo result in integer domain for (p + 1) % N will be the same as (((char*) p) + M) % (N * M). To cut a long story short: instead of the above it should just generate: 000000000041a910 <_ZN1RILj16EE11xxx_releaseEPv>: mov 0x80(%rdi),%rdx add $0x08, %rdx sub %rdi, %rdx and $0x7f, %rdx add %rdi, %rdx mov %rsi, (%rdx) mov %rdx, 0x80(%rdi) retq I'm not sure I understand what is GCC trying to achieve with the shifts. Secondly, my example above uses only simple addressing modes, but GCC uses the most complex mode twice: in lea and in the subsequent mov. Complex lea has 3 cycles of latency on SandyBridge, according to the Intel Optimization Manual, p. 3.5.1.3, so it should best be avoided.