Re: [RFC][AArch64] function prologue analyzer in linux kernel

AKASHI Takahiro Wed, 13 Jan 2016 00:14:51 -0800

On 01/13/2016 03:04 AM, Will Deacon wrote:

On Tue, Jan 12, 2016 at 03:11:29PM +0900, AKASHI Takahiro wrote:

Will,


On 01/09/2016 12:53 AM, Will Deacon wrote:

On Fri, Jan 08, 2016 at 02:36:32PM +0900, AKASHI Takahiro wrote:

On 01/07/2016 11:56 PM, Richard Earnshaw (lists) wrote:

On 07/01/16 14:22, Will Deacon wrote:

On Thu, Dec 24, 2015 at 04:57:54PM +0900, AKASHI Takahiro wrote:

So I'd like to introduce a function prologue analyzer to determine
a size allocated by a function's prologue and deduce it from "Depth".
My implementation of this analyzer has been submitted to
linux-arm-kernel mailing list[1].
I borrowed some ideas from gdb's analyzer[2], especially a loop of
instruction decoding as well as stop of decoding at exiting a basic block,
but implemented my own simplified one because gdb version seems to do
a bit more than what we expect here.
Anyhow, since it is somewhat heuristic (and may not be maintainable for
a long term), could you review it from a broader viewpoint of toolchain,
please?

My main issue with this is that we cannot rely on the frame layout
generated by the compiler and there's little point in asking for
commitment here. Therefore, the heuristics will need updating as and
when we identify new frames that we can't handle. That's pretty fragile
and puts us on the back foot when faced with newer compilers. This might
be sustainable if we don't expect to encounter much variation, but even
that would require some sort of "buy-in" from the various toolchain
communities.

GCC already has an option (-fstack-usage) to determine the stack usage
on a per-function basis and produce a report at build time. Why can't
we use that to provide the information we need, rather than attempt to
compute it at runtime based on your analyser?

If -fstack-usage is not sufficient, understanding why might allow us to
propose a better option.


Can you not use the dwarf frame unwind data?  That's always sufficient
to recover the CFA (canonical frame address - the value in SP when
executing the first instruction in a function).  It seems to me it's
unlikely you're going to need something that's an exceedingly high
performance operation.


Thank you for your comment.
Yeah, but we need some utility routines to handle unwind data(.debug_frame).
In fact, some guy has already attempted to merge (part of) libunwind into
the kernel[1], but it was rejected by the kernel community (including Linus
if I correctly remember). It seems that they thought the code was still buggy.


The ARC guys seem to have sneaked something in for their architecture:

   
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arc/kernel/unwind.c

so it might not be impossible if we don't require all the bells and
whistles of libunwind.


Thanks. I didn't notice this code.

That is one of reasons that I wanted to implement my own analyzer.


I still don't understand why you can't use fstack-usage. Can you please
tell me why that doesn't work? Am I missing something?


I don't know how gcc calculates the usage here, but I guess it would be more
robust than my analyzer.

The issues, that come up to my mind, are
- -fstack-usage generates a separate output file, *.su and so we have to
   manage them to be incorporated in the kernel binary.


That doesn't sound too bad to me. How much data are we talking about here?

   This implies that (common) kernel makefiles might have to be a bit changed.
- more worse, what if kernel module case? We will have no way to let the kernel
   know the stack usage without adding an extra step at loading.


We can easily add a new __init section to modules, which is a table
representing the module functions and their stack sizes (like we do
for other things like alternatives). We'd just then need to slurp this
information at load time and throw it into an rbtree or something.


I found another issue.
Let's think about 'dynamic storage' case like:
$ cat stack.c
extern long fooX(long a);
extern long fooY(long b[]);

long foo1(long a) {

        if (a > 1) {
                long b[a];  <== Here

                return a + fooY(b);
        } else {
                return a + fooX(a);
        }
}

Then, -fstack-usage returns 48 for foo1():
$ aarch64-linux-gnu-gcc -fno-omit-frame-pointer -fstack-usage main.c stack.c \
      -pg -O2 -fasynchronous-unwind-tables
$ cat stack.su
stack.c:4:6:foo1        48      dynamic

This indicates that foo1() may use 48 bytes or more depending on a condition.
But in my case (ftrace-based stack tracer), I always expect 32 whether we're
backtracing from fooY() or from fooX() because my stack tracer estimates:
       (stack pointer) = (callee's frame pointer) + (callee's stack usage)
(in my previous e-mail, '-(minus)' was wrong.)

where (callee's stack usage) is, as I described in my previous e-mail, a size of
memory which is initially allocated on a stack in a function prologue, and 
should not
contain a size of dynamically allocate area.

Unfortunately, there are several places in the kernel where "b[a]"-like variable
definitions are used.

-Takahiro AKASHI
FYI,
(gdb) disas foo1
Dump of assembler code for function foo1:
   0x0000000000400758 <+0>:       stp     x29, x30, [sp,#-32]!
   0x000000000040075c <+4>:       mov     x29, sp
   0x0000000000400760 <+8>:       stp     x19, x20, [sp,#16]
   0x0000000000400764 <+12>:      mov     x19, x0
   0x0000000000400768 <+16>:      mov     x0, x30
   0x000000000040076c <+20>:      bl      0x400540 <_mcount@plt>
   0x0000000000400770 <+24>:      cmp     x19, #0x1
   0x0000000000400774 <+28>:      b.le    0x4007b0 <foo1+88>
   0x0000000000400778 <+32>:      lsl     x0, x19, #3
   0x000000000040077c <+36>:      mov     x20, sp
   0x0000000000400780 <+40>:      add     x0, x0, #0x16
   0x0000000000400784 <+44>:      and     x0, x0, #0xfffffffffffffff0
   0x0000000000400788 <+48>:      sub     sp, sp, x0
   0x000000000040078c <+52>:      mov     x0, sp
   0x0000000000400790 <+56>:      bl      0x400730 <fooY>
   0x0000000000400794 <+60>:      mov     sp, x20
   0x0000000000400798 <+64>:      mov     sp, x29
   0x000000000040079c <+68>:      add     x0, x19, x0
   0x00000000004007a0 <+72>:      ldp     x19, x20, [sp,#16]
   0x00000000004007a4 <+76>:      ldp     x29, x30, [sp],#32
   0x00000000004007a8 <+80>:      ret
   0x00000000004007ac <+84>:      nop
   0x00000000004007b0 <+88>:      mov     x0, x19
   0x00000000004007b4 <+92>:      bl      0x400708 <fooX>
   0x00000000004007b8 <+96>:      mov     sp, x29
   0x00000000004007bc <+100>:     add     x0, x19, x0
   0x00000000004007c0 <+104>:     ldp     x19, x20, [sp,#16]
   0x00000000004007c4 <+108>:     ldp     x29, x30, [sp],#32
   0x00000000004007c8 <+112>:     ret
End of assembler dump.

Will

Re: [RFC][AArch64] function prologue analyzer in linux kernel

Reply via email to