Hello,

I've come across an issue when working on a smart pointer implementation. Gcc does not seem to propagate constants enough, missing some optimization opportunities. I don't think that this issue is specific to smart pointers, so there might be other cases when gcc generates suboptimal code.

Attached a simple test case. The smart pointer here is a unique pointer, always only a single instance holds a raw pointer to the resource. The deletion can be customized through a policy class. In main(), I allocate an int, then pass it through several smart pointers. At the end, the last smart pointer holds the raw pointer to the allocated memory.

Compiled as:
g++ -g -O3 -o gccoptbug.o -c gccoptbug.cpp
g++ -o gccoptbug gccoptbug.o

The generated code on AMD64 looks like this:
   0x00000000004004d0 <+0>:    sub    $0x8,%rsp
   0x00000000004004d4 <+4>:    mov    $0x4,%edi
   0x00000000004004d9 <+9>:    callq  0x4004c0 <_Znwm@plt> ; operator new
   0x00000000004004de <+14>:    mov    %rax,%rdi
0x00000000004004e1 <+17>: callq 0x4004a0 <_ZdlPv@plt> ; operator delete
   0x00000000004004e6 <+22>:    xor    %edi,%edi
   0x00000000004004e8 <+24>:    callq  0x4004a0 <_ZdlPv@plt>
   0x00000000004004ed <+29>:    xor    %edi,%edi
   0x00000000004004ef <+31>:    callq  0x4004a0 <_ZdlPv@plt>
   0x00000000004004f4 <+36>:    xor    %edi,%edi
   0x00000000004004f6 <+38>:    callq  0x4004a0 <_ZdlPv@plt>
   0x00000000004004fb <+43>:    xor    %edi,%edi
   0x00000000004004fd <+45>:    callq  0x4004a0 <_ZdlPv@plt>
   0x0000000000400502 <+50>:    xor    %eax,%eax
   0x0000000000400504 <+52>:    add    $0x8,%rsp
   0x0000000000400508 <+56>:    retq

The allocated memory is freed, then op delete is called four times with a 0 pointer. The dtor and the called deleter fn was inlined. So far so good.

If I modify the deleter policy to call op delete only when the pointer is not zero (#if 1 at line 6), the generated code changes to:
   0x00000000004004d0 <+0>:    sub    $0x58,%rsp
   0x00000000004004d4 <+4>:    mov    $0x4,%edi
   0x00000000004004d9 <+9>:    callq  0x4004c0 <_Znwm@plt>
   0x00000000004004de <+14>:    lea    0x40(%rsp),%rdi
   0x00000000004004e3 <+19>:    mov    %rax,0x40(%rsp)
   0x00000000004004e8 <+24>:    movq   $0x0,(%rsp)
   0x00000000004004f0 <+32>:    movq   $0x0,0x10(%rsp)
   0x00000000004004f9 <+41>:    movq   $0x0,0x20(%rsp)
   0x0000000000400502 <+50>:    movq   $0x0,0x30(%rsp)
0x000000000040050b <+59>: callq 0x400630 <Ptr<int, Deleter<int> >::~Ptr()>
   0x0000000000400510 <+64>:    lea    0x30(%rsp),%rdi
0x0000000000400515 <+69>: callq 0x400630 <Ptr<int, Deleter<int> >::~Ptr()>
   0x000000000040051a <+74>:    lea    0x20(%rsp),%rdi
0x000000000040051f <+79>: callq 0x400630 <Ptr<int, Deleter<int> >::~Ptr()>
   0x0000000000400524 <+84>:    lea    0x10(%rsp),%rdi
0x0000000000400529 <+89>: callq 0x400630 <Ptr<int, Deleter<int> >::~Ptr()>
   0x000000000040052e <+94>:    mov    %rsp,%rdi
0x0000000000400531 <+97>: callq 0x400630 <Ptr<int, Deleter<int> >::~Ptr()>
   0x0000000000400536 <+102>:    xor    %eax,%eax
   0x0000000000400538 <+104>:    add    $0x58,%rsp
   0x000000000040053c <+108>:    retq

Instead of eliminating the calls to op delete, the actual smart ptr objects appear on the stack, and the dtor is not inlined anymore.

gcc 4.4 and 4.5 optimizes as expected:
   0x0000000000400640 <+0>:    sub    $0x8,%rsp
   0x0000000000400644 <+4>:    mov    $0x4,%edi
   0x0000000000400649 <+9>:    callq  0x400540 <_Znwm@plt>
   0x000000000040064e <+14>:    test   %rax,%rax
   0x0000000000400651 <+17>:    je     0x40065b <main()+27>
   0x0000000000400653 <+19>:    mov    %rax,%rdi
   0x0000000000400656 <+22>:    callq  0x400510 <_ZdlPv@plt>
   0x000000000040065b <+27>:    xor    %eax,%eax
   0x000000000040065d <+29>:    add    $0x8,%rsp
   0x0000000000400661 <+33>:    retq

4.6 and 4.7 (r182889) generates the suboptimal code as above.

I've checked bugzilla, and #46076 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46076) is related, I guess. There, Jan Hubicka 2010-10-19 03:20:48 UTC writes that main() is optimized for size. To check this, I've added foo() to the test case, and it is optimized correctly w/ 4.6 and 4.7. Moreover, -Os produces the same foo() and main() functions. However, the size optimized version is more than 3 times as large as the other one. Is this normal?

Regards, Peter

template<typename T>
struct Deleter
{
	static void Delete(T* p_) 
	{
#if 0 // if enabled, Delete() is not inlined
		if (p_)
#endif
	 		delete p_; 
	}
};

template<typename T, class D = Deleter<T> >
class Ptr
{
public:
	Ptr() :	m_ptr(0)
	{
	}

	Ptr(T* p_) : m_ptr(p_)
	{
	}

	Ptr(const Ptr& p_) : m_ptr(p_.Forget())
	{
	}

	~Ptr()
	{
		D::Delete(m_ptr);
	}

	T* Forget() const
	{
		T* s = m_ptr;
		m_ptr = 0;
		return s;
	}

private:
	mutable T*	m_ptr;
};

int main()
{
	typedef Ptr<int> MyPtr;

	MyPtr p0 = new int;
	MyPtr p1 = p0;
	MyPtr p2 = p1;
	MyPtr p3 = p2;
	MyPtr p4 = p3;
}

Reply via email to