Hi,

I've been playing around gcc -flto flag and inlining functionnalities for a
while in search of both optimized performance and full understanding of g++
behavious.

Right now, I'm puzzled by the assembly output produced for that piece of code:

#include <iostream>

using namespace std;

class A
{
public:
        inline virtual void blah()
        {
                cout << "A" << endl;
        }
};


class B : public A
{
public:
        inline virtual void blah()
        {
                cout << "B" << endl;
        }
};

class C
{
public:
        void blah()
        {
                cout << "C" << endl;
        }
};

int main(int argc, char** argv)
{
        A* ptr = 0;
        if(argc == 1)
                ptr = new B();
        else
                ptr = new A();

        ptr->blah();

        B().blah();
        C().blah();
}

I would expect the compiler to be able to inline function blah() when it is
statically called for class B and C but have a VTable resolution for the call
ptr->blah. Here's the relevant assembly code produced by g++ with flags -O3 and
-S:

main:
.LFB976:
        .cfi_startproc
        subq    $24, %rsp
        .cfi_def_cfa_offset 32
        cmpl    $1, %edi
        movl    $8, %edi
        je      .L18
        call    _Znwm
        movq    %rax, %rdi
        movq    $_ZTV1A+16, (%rax)
        movl    $_ZTV1A+16, %eax
.L16:
        call    *(%rax)
        movq    %rsp, %rdi
        movq    $_ZTV1B+16, (%rsp)
        call    _ZN1B4blahEv
        movl    $.LC2, %esi
        movl    $_ZSt4cout, %edi
        call    _ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
        movq    %rax, %rdi
        call    _ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_
        xorl    %eax, %eax
        addq    $24, %rsp
        .cfi_remember_state
        .cfi_def_cfa_offset 8
        ret
.L18:
        .cfi_restore_state
        call    _Znwm
        movq    %rax, %rdi
        movq    $_ZTV1B+16, (%rax)
        movl    $_ZTV1B+16, %eax
        jmp     .L16
        .cfi_endproc

The puzzling part is to find that the call for C().blah() is indeed inlined and
the ptr->blah() uses a VTable resolution, but the code for B.blah() uses
neither: the static adress is resolved but the code is not inlined! (The same
behaviour occurs if there would be a static-typed pointer to an object of class
B). I understand the compiler propagates the types properly, but even after
determining the correct type for the object of type B, it only resolves the
vtable reference (hence no call *(%..x) ), but cannot perform the inlining.

Question: why ? Can someone explain me the exact order in which the optimization
of g++ are performed and how they interact with each other ? I know this might
be tricky but any small shed of light could be helpfull. Also, did I miss a
flag which would enable g++ to proceed to do the inlining after the resolution
?

>From a practical point of view, I understand this example does not justify by
itself the absolute need for inlining. However, I do have a time-critical
application that would get 25-30% increase in speed if I could solve this
issue. Also, I'm just curious to understand why is this the behaviour of g++
(or if it's actually a bug) because it counter my most primitive intuition and
the beliefs of many people I know.

Thanks in advance for any answer to come.

Kind Regards

--
Thierry Lavoie, B.Ing., M.scA.
PhD. Student, Polytechnique Montreal
Lecturer INF2010: Data Structures and Algorithm
Lecturer LOG3210: Languages and Compilers

Reply via email to