http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194

           Summary: Trivially stupid inlining decisions
           Product: gcc
           Version: 4.5.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: torva...@linux-foundation.org


So gcc inlining heuristics seem to include the rule that if it's called from a
single call-site, and is static, then it should be inlined.

That is actually really horrible for some cases. 

In particular, if you have a small function that has an uncommon case handled
by a much bigger function, you absolutely do *not* want to inline the big
function because it ends up creating a total monster of a stack frame -
something that the small function on its own doesn't need.

So for example, in the kernel we have various of these kinds of situations
where we decrement a refcount or a lock, and then in the unlikely situation
that something is true, we need to so much more processing as a result. An
example of this would be "__rcu_read_unlock()", which basically does

 - decrement the per-thread "rcu_read_lock_nesting" variable
 - if that variable is now zero, *and* we have a pending RCU event, we need to
do some special complicated stuff.

The code is basically written (we have various barriers and helper macros etc
that are irrelevant to the inlining issue) as

    --t->rcu_read_lock_nesting;
    if (t->rcu_read_lock_nesting == 0 &&
     __builtin_expect(!!t->rcu_read_unlock_special, 0))
        rcu_read_unlock_special(t);

(where "t" is the thread pointer).

And that's all. However, because gcc inlines the special case, the function
prologue ends up looking like this (that "current_task" is from our inline asm,
it's just us getting the thread variable):

  __rcu_read_unlock:
        pushq   %rbp    #
        movq    %rsp, %rbp      #,
        subq    $48, %rsp       #,
        movq    %r13, -24(%rbp) #,
        movq    %rbx, -40(%rbp) #,
        movq    %r12, -32(%rbp) #,
        movq    %r14, -16(%rbp) #,
        movq    %r15, -8(%rbp)  #,
        movq %gs:current_task,%r13      #, t
        decl    256(%r13)       # t_15->rcu_read_lock_nesting
        ...

which is pretty horrid. It uses a lot of stack, and has stupid useless
instructions.

Now, I can just mark that rcu_read_unlock_special() function as "noinline", and
as a result I get this:

__rcu_read_unlock:
        pushq   %rbp    #
        movq    %rsp, %rbp      #,
        movq %gs:current_task,%rdi      #, t
        decl    256(%rdi)       # t_15->rcu_read_lock_nesting
        ...

which is obviously much superior code for the case that actually matters.

So rather than have to find all of these things manually (yes, those things do
actually show up on profiles - the stack expansion in particular ends up
showing up as extra costs), maybe gcc could just have a simple check:

 - if the call is in an unlikely section
 - .. and the callee is much bigger (in stack frame or whatever) than the
caller
 - .. simply don't inline

Hmm? I realize that this may sound specialized, but I've been looking at kernel
profiles for the last few weeks now, and this particular issue has come up in
two independent hotspots. It turns out that excessive stack use is *expensive*.
It's not just the normal "the kernel stack is limited", it's actually a matter
of "the L1 D$ is pretty small, and a big stack frame actually causes real
costs".

I really didn't expect register saves on the stack to show up in my profiles,
but they actually do. I expect the stack to basically always be in the L1 D$
and dirty (so that an access to it should be almost free), but that is only
true for a _dense_ stack. Not for leaf functions that then have extra stack
frame wasting code in them.

Comments?

Reply via email to