On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote:
> On May 23, 2016 7:28 PM, "Josh Poimboeuf" <jpoim...@redhat.com> wrote:
> > > Maybe I'm coming around to liking this idea.
> >
> > Ok, good :-)
> >
> > > In an ideal world (DWARF support, high-quality unwinder, nice pretty
> > > printer, etc), unwinding across a kernel exception would look like:
> > >
> > >  - some_func
> > >  - some_other_func
> > >  - do_page_fault
> > >  - page_fault
> > >
> > > After page_fault, the next unwind step takes us to the faulting RIP
> > > (faulting_func) and reports that all GPRs are known.  It should
> > > probably learn this fact from DWARF if DWARF is available, instead of
> > > directly decoding pt_regs, due to a few funny cases in which pt_regs
> > > may be incomplete.  A nice pretty printer could now print all the
> > > regs.
> > >
> > >  - faulting_func
> > >  - etc.
> > >
> > > For this to work, we don't actually need the unwinder to explicitly
> > > know where pt_regs is.
> >
> > That's true (but only for DWARF).
> >
> > > Food for thought, though: if user code does SYSENTER with TF set,
> > > you'll end up with partial pt_regs.  There's nothing the kernel can do
> > > about it.  DWARF will handle it without any fanfare, but I don't know
> > > if it's going to cause trouble for you pre-DWARF.
> >
> > In this case it should see the stack pointer is past the pt_regs offset,
> > so it would just report it as an empty stack.
> 
> OK
> 
> >
> > > I'm also not sure it makes sense to apply this before the unwinder
> > > that can consume it is ready.  Maybe, if it would be consistent with
> > > your plans, it would make sense to rewrite the unwinder first, then
> > > apply this and teach live patching to use the new unwinder, and *then*
> > > add DWARF support?
> >
> > For the purposes of livepatch, the reliable unwinder needs to detect
> > whether an interrupt/exception pt_regs frame exists on a sleeping task
> > (or current).  This patch would allow us to do that.
> >
> > So my preferred order of doing things would be:
> >
> > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
> > 2) this patch for annotating pt_regs stack frames
> > 3) reliable unwinder, similar to what I already posted, except it relies
> >    on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
> >    the new inactive task frame format of #1
> > 4) livepatch consistency model which uses the reliable unwinder
> > 5) rewrite unwinder, and port all users to the new interface
> > 6) add DWARF unwinder
> >
> > 1-4 are pretty much already written, whereas 5 and 6 will take
> > considerably more work.
> 
> Fair enough.
> 
> >
> > > > +       /*
> > > > +        * Create a stack frame for the saved pt_regs.  This allows 
> > > > frame
> > > > +        * pointer based unwinders to find pt_regs on the stack.
> > > > +        */
> > > > +       .macro CREATE_PT_REGS_FRAME regs=%rsp
> > > > +#ifdef CONFIG_FRAME_POINTER
> > > > +       pushq   \regs
> > > > +       pushq   $pt_regs+1
> > >
> > > Why the +1?
> >
> > Some unwinders like gdb are smart enough to report the function which
> > contains the instruction *before* the return address.  Without the +1,
> > they would show the wrong function.
> 
> Lovely.  Want to add a comment?
> 
> >
> > > > +       pushq   %rbp
> > > > +       movq    %rsp, %rbp
> > > > +#endif
> > > > +       .endm
> > >
> > > I keep wanting this to be only two pushes and to fudge rbp to make it
> > > work, but I don't see a good way.  But let's call it
> > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
> > > nested_frame or similar.
> >
> > Or, if we aren't going to annotate syscall pt_regs, we could give it a
> > more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?
> 
> CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
> too.  CREATE_PT_REGS_FRAME is probably fine.
> 
> > > > +
> > > > +/* fake function which allows stack unwinders to detect pt_regs frames 
> > > > */
> > > > +#ifdef CONFIG_FRAME_POINTER
> > > > +ENTRY(pt_regs)
> > > > +       nop
> > > > +       nop
> > > > +ENDPROC(pt_regs)
> > > > +#endif /* CONFIG_FRAME_POINTER */
> > >
> > > Why is this two bytes long?  Is there some reason it has to be more
> > > than one byte?
> >
> > Similar to above, this is related to the need to support various
> > unwinders.  Whether the unwinder displays the ret_addr or the
> > instruction preceding it, either way the instruction needs to be inside
> > the function for the function to be reported.
> 
> OK.

Andy,

So I got a chance to look at this some more.  I'm thinking that to make
this feature more consistently useful, we shouldn't only annotate
pt_regs frames for calls to handlers; other calls should be annotated as
well: preempt_schedule_irq, CALL_enter_from_user_mode,
prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
etc.  That way, the unwinder will always be able to find pt_regs from an
interrupt/exception, even if starting from one of these other calls.

But then, things get ugly.  You have to either setup and tear down the
frame for every possible call, or do a higher-level setup/teardown
across multiple calls, which invalidates several assumptions in the
entry code about the location of pt_regs on the stack.

Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
make assumptions about the location of pt_regs.  And they're used by
both syscall and interrupt code.  So if we didn't create a frame pointer
header for syscalls, we'd basically need two versions of the macros: one
for irqs/exceptions and one for syscalls.

So I think the cleanest way to handle this is to always allocate two
extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
entry code can assume that pt_regs is at a constant location, and all
the above problems go away.  Another benefit is that we'd only need two
saves instead of three -- the pointer to pt_regs is no longer needed
since pt_regs is always immediately after the frame header.

I worked up a patch to implement this -- see below.  It writes the frame
pointer in all entry paths, including syscalls.  This helps keep the
code simple.

The downside is a small performance penalty: with getppid()-in-a-loop on
my laptop, the average syscall went from 52ns to 53ns, which is about a
2% slowdown.  But I doubt it would be measurable in a real-world
workload.

It looks like about half the slowdown is due to the extra stack
allocation (which presumably adds a little d-cache pressure on the stack
memory) and the other half is due to the stack writes.

I could remove the writes from the syscall path but it would only save
about half a ns, and it would make the code less robust.  Plus it's nice
to have the consistency of having *all* pt_regs frames annotated.

Thoughts?

----

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 9a9e588..0f6ccfc 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -88,30 +88,69 @@ For 32-bit we have the following conventions - kernel is 
built with
 #define RSP            19*8
 #define SS             20*8
 
-#define SIZEOF_PTREGS  21*8
 
-       .macro ALLOC_PT_GPREGS_ON_STACK addskip=0
-       addq    $-(15*8+\addskip), %rsp
+#ifdef CONFIG_FRAME_POINTER
+
+#define PT_REGS_OFFSET 2*8
+
+       /*
+        * Create an entry stack frame pointer header which corresponds to the
+        * saved pt_regs.  This allows frame pointer based unwinders to find
+        * pt_regs on the stack.  The frame pointer and the return address of a
+        * fake function are stored immediately before the pt_regs.
+        */
+       .macro SAVE_ENTRY_FRAME_POINTER
+       movq    %rbp,                   0*8(%rsp)
+       movq    $entry_frame_ret,       1*8(%rsp)
+       movq    %rsp, %rbp
+       .endm
+
+       .macro RESTORE_ENTRY_FRAME_POINTER
+       movq    (%rsp), %rbp
+       .endm
+
+       .macro ALLOC_AND_SAVE_ENTRY_FRAME_POINTER
+       subq    $PT_REGS_OFFSET, %rsp
+       SAVE_ENTRY_FRAME_POINTER
+       .endm
+
+#else /* !CONFIG_FRAME_POINTER */
+
+#define PT_REGS_OFFSET 0
+#define SAVE_ENTRY_FRAME_POINTER
+#define RESTORE_ENTRY_FRAME_POINTER
+#define ALLOC_AND_SAVE_ENTRY_FRAME_POINTER
+
+#endif /* CONFIG_FRAME_POINTER */
+
+#define PT_REGS_SIZE           21*8
+#define ENTRY_FRAME_SIZE       (PT_REGS_OFFSET+PT_REGS_SIZE)
+
+#define TI_FLAGS(rsp) ASM_THREAD_INFO(TI_flags, rsp, ENTRY_FRAME_SIZE)
+#define PT_REGS(reg) reg+PT_REGS_OFFSET(%rsp)
+
+       .macro ALLOC_ENTRY_FRAME addskip=0
+       addq    $-(15*8+PT_REGS_OFFSET+\addskip), %rsp
        .endm
 
        .macro SAVE_C_REGS_HELPER offset=0 rax=1 rcx=1 r8910=1 r11=1
        .if \r11
-       movq %r11, 6*8+\offset(%rsp)
+       movq %r11, 6*8+PT_REGS_OFFSET+\offset(%rsp)
        .endif
        .if \r8910
-       movq %r10, 7*8+\offset(%rsp)
-       movq %r9,  8*8+\offset(%rsp)
-       movq %r8,  9*8+\offset(%rsp)
+       movq %r10, 7*8+PT_REGS_OFFSET+\offset(%rsp)
+       movq %r9,  8*8+PT_REGS_OFFSET+\offset(%rsp)
+       movq %r8,  9*8+PT_REGS_OFFSET+\offset(%rsp)
        .endif
        .if \rax
-       movq %rax, 10*8+\offset(%rsp)
+       movq %rax, 10*8+PT_REGS_OFFSET+\offset(%rsp)
        .endif
        .if \rcx
-       movq %rcx, 11*8+\offset(%rsp)
+       movq %rcx, 11*8+PT_REGS_OFFSET+\offset(%rsp)
        .endif
-       movq %rdx, 12*8+\offset(%rsp)
-       movq %rsi, 13*8+\offset(%rsp)
-       movq %rdi, 14*8+\offset(%rsp)
+       movq %rdx, 12*8+PT_REGS_OFFSET+\offset(%rsp)
+       movq %rsi, 13*8+PT_REGS_OFFSET+\offset(%rsp)
+       movq %rdi, 14*8+PT_REGS_OFFSET+\offset(%rsp)
        .endm
        .macro SAVE_C_REGS offset=0
        SAVE_C_REGS_HELPER \offset, 1, 1, 1, 1
@@ -128,23 +167,39 @@ For 32-bit we have the following conventions - kernel is 
built with
        .macro SAVE_C_REGS_EXCEPT_RAX_RCX_R11
        SAVE_C_REGS_HELPER 0, 0, 0, 1, 0
        .endm
+       .macro SAVE_C_REGS_EXCEPT_RAX
+       SAVE_C_REGS_HELPER rax=0
+       .endm
+
+       .macro SAVE_EXTRA_REGS offset=0 rbx=1
+       movq %r15, 0*8+PT_REGS_OFFSET+\offset(%rsp)
+       movq %r14, 1*8+PT_REGS_OFFSET+\offset(%rsp)
+       movq %r13, 2*8+PT_REGS_OFFSET+\offset(%rsp)
+       movq %r12, 3*8+PT_REGS_OFFSET+\offset(%rsp)
+#ifdef CONFIG_FRAME_POINTER
+       /* copy rbp value from entry frame header */
+       movq \offset(%rsp), %rbp
+       movq %rbp, 4*8+PT_REGS_OFFSET+\offset(%rsp)
+       leaq \offset(%rsp), %rbp
+#else
+       movq %rbp, 4*8+PT_REGS_OFFSET+\offset(%rsp)
+#endif
+       .if \rbx
+       movq %rbx, 5*8+PT_REGS_OFFSET+\offset(%rsp)
+       .endif
+       .endm
 
-       .macro SAVE_EXTRA_REGS offset=0
-       movq %r15, 0*8+\offset(%rsp)
-       movq %r14, 1*8+\offset(%rsp)
-       movq %r13, 2*8+\offset(%rsp)
-       movq %r12, 3*8+\offset(%rsp)
-       movq %rbp, 4*8+\offset(%rsp)
-       movq %rbx, 5*8+\offset(%rsp)
+       .macro SAVE_EXTRA_REGS_EXCEPT_RBX
+       SAVE_EXTRA_REGS rbx=0
        .endm
 
        .macro RESTORE_EXTRA_REGS offset=0
-       movq 0*8+\offset(%rsp), %r15
-       movq 1*8+\offset(%rsp), %r14
-       movq 2*8+\offset(%rsp), %r13
-       movq 3*8+\offset(%rsp), %r12
-       movq 4*8+\offset(%rsp), %rbp
-       movq 5*8+\offset(%rsp), %rbx
+       movq 0*8+PT_REGS_OFFSET+\offset(%rsp), %r15
+       movq 1*8+PT_REGS_OFFSET+\offset(%rsp), %r14
+       movq 2*8+PT_REGS_OFFSET+\offset(%rsp), %r13
+       movq 3*8+PT_REGS_OFFSET+\offset(%rsp), %r12
+       movq 4*8+PT_REGS_OFFSET+\offset(%rsp), %rbp
+       movq 5*8+PT_REGS_OFFSET+\offset(%rsp), %rbx
        .endm
 
        .macro ZERO_EXTRA_REGS
@@ -158,24 +213,24 @@ For 32-bit we have the following conventions - kernel is 
built with
 
        .macro RESTORE_C_REGS_HELPER rstor_rax=1, rstor_rcx=1, rstor_r11=1, 
rstor_r8910=1, rstor_rdx=1
        .if \rstor_r11
-       movq 6*8(%rsp), %r11
+       movq 6*8+PT_REGS_OFFSET(%rsp), %r11
        .endif
        .if \rstor_r8910
-       movq 7*8(%rsp), %r10
-       movq 8*8(%rsp), %r9
-       movq 9*8(%rsp), %r8
+       movq 7*8+PT_REGS_OFFSET(%rsp), %r10
+       movq 8*8+PT_REGS_OFFSET(%rsp), %r9
+       movq 9*8+PT_REGS_OFFSET(%rsp), %r8
        .endif
        .if \rstor_rax
-       movq 10*8(%rsp), %rax
+       movq 10*8+PT_REGS_OFFSET(%rsp), %rax
        .endif
        .if \rstor_rcx
-       movq 11*8(%rsp), %rcx
+       movq 11*8+PT_REGS_OFFSET(%rsp), %rcx
        .endif
        .if \rstor_rdx
-       movq 12*8(%rsp), %rdx
+       movq 12*8+PT_REGS_OFFSET(%rsp), %rdx
        .endif
-       movq 13*8(%rsp), %rsi
-       movq 14*8(%rsp), %rdi
+       movq 13*8+PT_REGS_OFFSET(%rsp), %rsi
+       movq 14*8+PT_REGS_OFFSET(%rsp), %rdi
        .endm
        .macro RESTORE_C_REGS
        RESTORE_C_REGS_HELPER 1,1,1,1,1
@@ -193,8 +248,8 @@ For 32-bit we have the following conventions - kernel is 
built with
        RESTORE_C_REGS_HELPER 1,0,0,1,1
        .endm
 
-       .macro REMOVE_PT_GPREGS_FROM_STACK addskip=0
-       subq $-(15*8+\addskip), %rsp
+       .macro FREE_ENTRY_FRAME addskip=0
+       subq $-(15*8+PT_REGS_OFFSET+\addskip), %rsp
        .endm
 
        .macro icebp
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 23e764c..ff92759 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -55,7 +55,7 @@ ENDPROC(native_usergs_sysret64)
 
 .macro TRACE_IRQS_IRETQ
 #ifdef CONFIG_TRACE_IRQFLAGS
-       bt      $9, EFLAGS(%rsp)                /* interrupts off? */
+       bt      $9, PT_REGS(EFLAGS)             /* interrupts off? */
        jnc     1f
        TRACE_IRQS_ON
 1:
@@ -88,7 +88,7 @@ ENDPROC(native_usergs_sysret64)
 .endm
 
 .macro TRACE_IRQS_IRETQ_DEBUG
-       bt      $9, EFLAGS(%rsp)                /* interrupts off? */
+       bt      $9, PT_REGS(EFLAGS)             /* interrupts off? */
        jnc     1f
        TRACE_IRQS_ON_DEBUG
 1:
@@ -164,22 +164,17 @@ GLOBAL(entry_SYSCALL_64_after_swapgs)
        pushq   $__USER_CS                      /* pt_regs->cs */
        pushq   %rcx                            /* pt_regs->ip */
        pushq   %rax                            /* pt_regs->orig_ax */
-       pushq   %rdi                            /* pt_regs->di */
-       pushq   %rsi                            /* pt_regs->si */
-       pushq   %rdx                            /* pt_regs->dx */
-       pushq   %rcx                            /* pt_regs->cx */
-       pushq   $-ENOSYS                        /* pt_regs->ax */
-       pushq   %r8                             /* pt_regs->r8 */
-       pushq   %r9                             /* pt_regs->r9 */
-       pushq   %r10                            /* pt_regs->r10 */
-       pushq   %r11                            /* pt_regs->r11 */
-       sub     $(6*8), %rsp                    /* pt_regs->bp, bx, r12-15 not 
saved */
+
+       ALLOC_ENTRY_FRAME
+       SAVE_ENTRY_FRAME_POINTER
+       SAVE_C_REGS_EXCEPT_RAX
+       movq    $-ENOSYS, PT_REGS(RAX)
 
        /*
         * If we need to do entry work or if we guess we'll need to do
         * exit work, go straight to the slow path.
         */
-       testl   $_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, 
ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
+       testl   $_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TI_FLAGS(%rsp)
        jnz     entry_SYSCALL64_slow_path
 
 entry_SYSCALL_64_fastpath:
@@ -207,7 +202,7 @@ entry_SYSCALL_64_fastpath:
        call    *sys_call_table(, %rax, 8)
 .Lentry_SYSCALL_64_after_fastpath_call:
 
-       movq    %rax, RAX(%rsp)
+       movq    %rax, PT_REGS(RAX)
 1:
 
        /*
@@ -217,15 +212,16 @@ entry_SYSCALL_64_fastpath:
         */
        DISABLE_INTERRUPTS(CLBR_NONE)
        TRACE_IRQS_OFF
-       testl   $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, 
SIZEOF_PTREGS)
+       testl   $_TIF_ALLWORK_MASK, TI_FLAGS(%rsp)
        jnz     1f
 
        LOCKDEP_SYS_EXIT
        TRACE_IRQS_ON           /* user mode is traced as IRQs on */
-       movq    RIP(%rsp), %rcx
-       movq    EFLAGS(%rsp), %r11
+       movq    PT_REGS(RIP), %rcx
+       movq    PT_REGS(EFLAGS), %r11
        RESTORE_C_REGS_EXCEPT_RCX_R11
-       movq    RSP(%rsp), %rsp
+       RESTORE_ENTRY_FRAME_POINTER
+       movq    PT_REGS(RSP), %rsp
        USERGS_SYSRET64
 
 1:
@@ -237,14 +233,14 @@ entry_SYSCALL_64_fastpath:
        TRACE_IRQS_ON
        ENABLE_INTERRUPTS(CLBR_NONE)
        SAVE_EXTRA_REGS
-       movq    %rsp, %rdi
+       leaq    PT_REGS_OFFSET(%rsp), %rdi
        call    syscall_return_slowpath /* returns with IRQs disabled */
        jmp     return_from_SYSCALL_64
 
 entry_SYSCALL64_slow_path:
        /* IRQs are off. */
        SAVE_EXTRA_REGS
-       movq    %rsp, %rdi
+       leaq    PT_REGS_OFFSET(%rsp), %rdi
        call    do_syscall_64           /* returns with IRQs disabled */
 
 return_from_SYSCALL_64:
@@ -255,8 +251,8 @@ return_from_SYSCALL_64:
         * Try to use SYSRET instead of IRET if we're returning to
         * a completely clean 64-bit userspace context.
         */
-       movq    RCX(%rsp), %rcx
-       movq    RIP(%rsp), %r11
+       movq    PT_REGS(RCX), %rcx
+       movq    PT_REGS(RIP), %r11
        cmpq    %rcx, %r11                      /* RCX == RIP */
        jne     opportunistic_sysret_failed
 
@@ -283,8 +279,8 @@ return_from_SYSCALL_64:
        cmpq    $__USER_CS, CS(%rsp)            /* CS must match SYSRET */
        jne     opportunistic_sysret_failed
 
-       movq    R11(%rsp), %r11
-       cmpq    %r11, EFLAGS(%rsp)              /* R11 == RFLAGS */
+       movq    PT_REGS(R11), %r11
+       cmpq    %r11, PT_REGS(EFLAGS)           /* R11 == RFLAGS */
        jne     opportunistic_sysret_failed
 
        /*
@@ -306,7 +302,7 @@ return_from_SYSCALL_64:
 
        /* nothing to check for RSP */
 
-       cmpq    $__USER_DS, SS(%rsp)            /* SS must match SYSRET */
+       cmpq    $__USER_DS, PT_REGS(SS)         /* SS must match SYSRET */
        jne     opportunistic_sysret_failed
 
        /*
@@ -316,7 +312,7 @@ return_from_SYSCALL_64:
 syscall_return_via_sysret:
        /* rcx and r11 are already restored (see code above) */
        RESTORE_C_REGS_EXCEPT_RCX_R11
-       movq    RSP(%rsp), %rsp
+       movq    PT_REGS(RSP), %rsp
        USERGS_SYSRET64
 
 opportunistic_sysret_failed:
@@ -408,6 +404,7 @@ END(__switch_to_asm)
  * r12: kernel thread arg
  */
 ENTRY(ret_from_fork)
+       ALLOC_AND_SAVE_ENTRY_FRAME_POINTER
        movq    %rax, %rdi
        call    schedule_tail                   /* rdi: 'prev' task parameter */
 
@@ -415,7 +412,7 @@ ENTRY(ret_from_fork)
        jnz     1f
 
 2:
-       movq    %rsp, %rdi
+       leaq    PT_REGS_OFFSET(%rsp), %rdi
        call    syscall_return_slowpath /* returns with IRQs disabled */
        TRACE_IRQS_ON                   /* user mode is traced as IRQS on */
        SWAPGS
@@ -430,7 +427,7 @@ ENTRY(ret_from_fork)
         * calling do_execve().  Exit to userspace to complete the execve()
         * syscall.
         */
-       movq    $0, RAX(%rsp)
+       movq    $0, PT_REGS(RAX)
        jmp     2b
 END(ret_from_fork)
 
@@ -460,11 +457,12 @@ END(irq_entries_start)
 /* 0(%rsp): ~(interrupt number) */
        .macro interrupt func
        cld
-       ALLOC_PT_GPREGS_ON_STACK
+       ALLOC_ENTRY_FRAME
+       SAVE_ENTRY_FRAME_POINTER
        SAVE_C_REGS
        SAVE_EXTRA_REGS
 
-       testb   $3, CS(%rsp)
+       testb   $3, PT_REGS(CS)
        jz      1f
 
        /*
@@ -500,6 +498,7 @@ END(irq_entries_start)
        /* We entered an interrupt context - irqs are off: */
        TRACE_IRQS_OFF
 
+       addq    $PT_REGS_OFFSET, %rdi
        call    \func   /* rdi points to pt_regs */
        .endm
 
@@ -521,12 +520,12 @@ ret_from_intr:
        /* Restore saved previous stack */
        popq    %rsp
 
-       testb   $3, CS(%rsp)
+       testb   $3, PT_REGS(CS)
        jz      retint_kernel
 
        /* Interrupt came from user space */
 GLOBAL(retint_user)
-       mov     %rsp,%rdi
+       leaq    PT_REGS_OFFSET(%rsp), %rdi
        call    prepare_exit_to_usermode
        TRACE_IRQS_IRETQ
        SWAPGS
@@ -537,7 +536,7 @@ retint_kernel:
 #ifdef CONFIG_PREEMPT
        /* Interrupts are off */
        /* Check if we need preemption */
-       bt      $9, EFLAGS(%rsp)                /* were interrupts off? */
+       bt      $9, PT_REGS(EFLAGS)             /* were interrupts off? */
        jnc     1f
 0:     cmpl    $0, PER_CPU_VAR(__preempt_count)
        jnz     1f
@@ -558,7 +557,7 @@ GLOBAL(restore_regs_and_iret)
        RESTORE_EXTRA_REGS
 restore_c_regs_and_iret:
        RESTORE_C_REGS
-       REMOVE_PT_GPREGS_FROM_STACK 8
+       FREE_ENTRY_FRAME 8
        INTERRUPT_RETURN
 
 ENTRY(native_iret)
@@ -699,11 +698,12 @@ ENTRY(\sym)
        pushq   $-1                             /* ORIG_RAX: no syscall to 
restart */
        .endif
 
-       ALLOC_PT_GPREGS_ON_STACK
+       ALLOC_ENTRY_FRAME
+       SAVE_ENTRY_FRAME_POINTER
 
        .if \paranoid
        .if \paranoid == 1
-       testb   $3, CS(%rsp)                    /* If coming from userspace, 
switch stacks */
+       testb   $3, PT_REGS(CS)                 /* If coming from userspace, 
switch stacks */
        jnz     1f
        .endif
        call    paranoid_entry
@@ -720,11 +720,11 @@ ENTRY(\sym)
        .endif
        .endif
 
-       movq    %rsp, %rdi                      /* pt_regs pointer */
+       leaq    PT_REGS_OFFSET(%rsp), %rdi      /* pt_regs pointer */
 
        .if \has_error_code
-       movq    ORIG_RAX(%rsp), %rsi            /* get error code */
-       movq    $-1, ORIG_RAX(%rsp)             /* no syscall to restart */
+       movq    PT_REGS(ORIG_RAX), %rsi         /* get error code */
+       movq    $-1, PT_REGS(ORIG_RAX)          /* no syscall to restart */
        .else
        xorl    %esi, %esi                      /* no error code */
        .endif
@@ -755,16 +755,15 @@ ENTRY(\sym)
 1:
        call    error_entry
 
-
-       movq    %rsp, %rdi                      /* pt_regs pointer */
-       call    sync_regs
+       movq    %rsp, %rdi                      /* stack frame + pt_regs */
+       call    sync_entry_frame
        movq    %rax, %rsp                      /* switch stack */
 
-       movq    %rsp, %rdi                      /* pt_regs pointer */
+       leaq    PT_REGS_OFFSET(%rsp), %rdi      /* pt_regs pointer */
 
        .if \has_error_code
-       movq    ORIG_RAX(%rsp), %rsi            /* get error code */
-       movq    $-1, ORIG_RAX(%rsp)             /* no syscall to restart */
+       movq    PT_REGS(ORIG_RAX), %rsi         /* get error code */
+       movq    $-1, PT_REGS(ORIG_RAX)          /* no syscall to restart */
        .else
        xorl    %esi, %esi                      /* no error code */
        .endif
@@ -922,7 +921,8 @@ ENTRY(xen_failsafe_callback)
        movq    8(%rsp), %r11
        addq    $0x30, %rsp
        pushq   $-1 /* orig_ax = -1 => not a system call */
-       ALLOC_PT_GPREGS_ON_STACK
+       ALLOC_ENTRY_FRAME
+       SAVE_ENTRY_FRAME_POINTER
        SAVE_C_REGS
        SAVE_EXTRA_REGS
        jmp     error_exit
@@ -1003,7 +1003,7 @@ paranoid_exit_no_swapgs:
 paranoid_exit_restore:
        RESTORE_EXTRA_REGS
        RESTORE_C_REGS
-       REMOVE_PT_GPREGS_FROM_STACK 8
+       FREE_ENTRY_FRAME 8
        INTERRUPT_RETURN
 END(paranoid_exit)
 
@@ -1016,7 +1016,7 @@ ENTRY(error_entry)
        SAVE_C_REGS 8
        SAVE_EXTRA_REGS 8
        xorl    %ebx, %ebx
-       testb   $3, CS+8(%rsp)
+       testb   $3, CS+8+PT_REGS_OFFSET(%rsp)
        jz      .Lerror_kernelspace
 
 .Lerror_entry_from_usermode_swapgs:
@@ -1049,12 +1049,12 @@ ENTRY(error_entry)
 .Lerror_kernelspace:
        incl    %ebx
        leaq    native_irq_return_iret(%rip), %rcx
-       cmpq    %rcx, RIP+8(%rsp)
+       cmpq    %rcx, RIP+8+PT_REGS_OFFSET(%rsp)
        je      .Lerror_bad_iret
        movl    %ecx, %eax                      /* zero extend */
-       cmpq    %rax, RIP+8(%rsp)
+       cmpq    %rax, RIP+8+PT_REGS_OFFSET(%rsp)
        je      .Lbstep_iret
-       cmpq    $.Lgs_change, RIP+8(%rsp)
+       cmpq    $.Lgs_change, RIP+8+PT_REGS_OFFSET(%rsp)
        jne     .Lerror_entry_done
 
        /*
@@ -1066,7 +1066,7 @@ ENTRY(error_entry)
 
 .Lbstep_iret:
        /* Fix truncated RIP */
-       movq    %rcx, RIP+8(%rsp)
+       movq    %rcx, RIP+8+PT_REGS_OFFSET(%rsp)
        /* fall through */
 
 .Lerror_bad_iret:
@@ -1182,29 +1182,20 @@ ENTRY(nmi)
        pushq   2*8(%rdx)       /* pt_regs->cs */
        pushq   1*8(%rdx)       /* pt_regs->rip */
        pushq   $-1             /* pt_regs->orig_ax */
-       pushq   %rdi            /* pt_regs->di */
-       pushq   %rsi            /* pt_regs->si */
-       pushq   (%rdx)          /* pt_regs->dx */
-       pushq   %rcx            /* pt_regs->cx */
-       pushq   %rax            /* pt_regs->ax */
-       pushq   %r8             /* pt_regs->r8 */
-       pushq   %r9             /* pt_regs->r9 */
-       pushq   %r10            /* pt_regs->r10 */
-       pushq   %r11            /* pt_regs->r11 */
-       pushq   %rbx            /* pt_regs->rbx */
-       pushq   %rbp            /* pt_regs->rbp */
-       pushq   %r12            /* pt_regs->r12 */
-       pushq   %r13            /* pt_regs->r13 */
-       pushq   %r14            /* pt_regs->r14 */
-       pushq   %r15            /* pt_regs->r15 */
 
-       /*
+       ALLOC_ENTRY_FRAME
+       SAVE_ENTRY_FRAME_POINTER
+       movq    (%rdx), %rdx
+       SAVE_C_REGS
+       SAVE_EXTRA_REGS
+
+/*
         * At this point we no longer need to worry about stack damage
         * due to nesting -- we're on the normal thread stack and we're
         * done with the NMI stack.
         */
 
-       movq    %rsp, %rdi
+       leaq    PT_REGS_OFFSET(%rsp), %rdi
        movq    $-1, %rsi
        call    do_nmi
 
@@ -1214,6 +1205,7 @@ ENTRY(nmi)
         * do_nmi doesn't modify pt_regs.
         */
        SWAPGS
+       RESTORE_ENTRY_FRAME_POINTER
        jmp     restore_c_regs_and_iret
 
 .Lnmi_from_kernel:
@@ -1405,7 +1397,8 @@ end_repeat_nmi:
         * frame to point back to repeat_nmi.
         */
        pushq   $-1                             /* ORIG_RAX: no syscall to 
restart */
-       ALLOC_PT_GPREGS_ON_STACK
+       ALLOC_ENTRY_FRAME
+       SAVE_ENTRY_FRAME_POINTER
 
        /*
         * Use paranoid_entry to handle SWAPGS, but no need to use paranoid_exit
@@ -1417,7 +1410,7 @@ end_repeat_nmi:
        call    paranoid_entry
 
        /* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
-       movq    %rsp, %rdi
+       leaq    PT_REGS_OFFSET(%rsp), %rdi
        movq    $-1, %rsi
        call    do_nmi
 
@@ -1430,7 +1423,7 @@ nmi_restore:
        RESTORE_C_REGS
 
        /* Point RSP at the "iret" frame. */
-       REMOVE_PT_GPREGS_FROM_STACK 6*8
+       FREE_ENTRY_FRAME 6*8
 
        /*
         * Clear "NMI executing".  Set DF first so that we can easily
@@ -1455,3 +1448,22 @@ ENTRY(ignore_sysret)
        mov     $-ENOSYS, %eax
        sysret
 END(ignore_sysret)
+
+#ifdef CONFIG_FRAME_POINTER
+/*
+ * This is a fake function which allows stack unwinders to detect entry stack
+ * frames.  The entry_frame_ret return address is stored on the stack after the
+ * frame pointer, immediately before pt_regs.
+ *
+ * Some unwinders like gdb are smart enough to report the function which
+ * contains the instruction *before* the return address on the stack.  More
+ * primitive unwinders like the kernel's will report the function containing
+ * the return address itself.  So the address needs to be in the middle of the
+ * function in order to satisfy them both.
+ */
+ENTRY(entry_frame)
+       nop
+GLOBAL(entry_frame_ret)
+       nop
+ENDPROC(entry_frame)
+#endif /* CONFIG_FRAME_POINTER */
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index e1721da..31b4f63c 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -309,21 +309,16 @@ ENTRY(entry_INT80_compat)
 
        /* Construct struct pt_regs on stack (iret frame is already on stack) */
        pushq   %rax                    /* pt_regs->orig_ax */
-       pushq   %rdi                    /* pt_regs->di */
-       pushq   %rsi                    /* pt_regs->si */
-       pushq   %rdx                    /* pt_regs->dx */
-       pushq   %rcx                    /* pt_regs->cx */
-       pushq   $-ENOSYS                /* pt_regs->ax */
-       pushq   $0                      /* pt_regs->r8  = 0 */
-       pushq   $0                      /* pt_regs->r9  = 0 */
-       pushq   $0                      /* pt_regs->r10 = 0 */
-       pushq   $0                      /* pt_regs->r11 = 0 */
-       pushq   %rbx                    /* pt_regs->rbx */
-       pushq   %rbp                    /* pt_regs->rbp */
-       pushq   %r12                    /* pt_regs->r12 */
-       pushq   %r13                    /* pt_regs->r13 */
-       pushq   %r14                    /* pt_regs->r14 */
-       pushq   %r15                    /* pt_regs->r15 */
+       ALLOC_ENTRY_FRAME
+       SAVE_ENTRY_FRAME_POINTER
+       movq    $-ENOSYS, %rax
+       xorq    %r8, %r8
+       xorq    %r9, %r9
+       xorq    %r10, %r10
+       xorq    %r11, %r11
+       SAVE_C_REGS
+       SAVE_EXTRA_REGS
+
        cld
 
        /*
@@ -332,7 +327,7 @@ ENTRY(entry_INT80_compat)
         */
        TRACE_IRQS_OFF
 
-       movq    %rsp, %rdi
+       leaq    PT_REGS_OFFSET(%rsp), %rdi
        call    do_int80_syscall_32
 .Lsyscall_32_done:
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index c3496619..bba7ece 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -70,7 +70,6 @@ dotraplinkage void do_segment_not_present(struct pt_regs *, 
long);
 dotraplinkage void do_stack_segment(struct pt_regs *, long);
 #ifdef CONFIG_X86_64
 dotraplinkage void do_double_fault(struct pt_regs *, long);
-asmlinkage struct pt_regs *sync_regs(struct pt_regs *);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *, long);
 dotraplinkage void do_page_fault(struct pt_regs *, unsigned long);
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 5df831e..f3c7922 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -362,24 +362,15 @@ early_idt_handler_common:
        incl early_recursion_flag(%rip)
 
        /* The vector number is currently in the pt_regs->di slot. */
-       pushq %rsi                              /* pt_regs->si */
-       movq 8(%rsp), %rsi                      /* RSI = vector number */
-       movq %rdi, 8(%rsp)                      /* pt_regs->di = RDI */
-       pushq %rdx                              /* pt_regs->dx */
-       pushq %rcx                              /* pt_regs->cx */
-       pushq %rax                              /* pt_regs->ax */
-       pushq %r8                               /* pt_regs->r8 */
-       pushq %r9                               /* pt_regs->r9 */
-       pushq %r10                              /* pt_regs->r10 */
-       pushq %r11                              /* pt_regs->r11 */
-       pushq %rbx                              /* pt_regs->bx */
-       pushq %rbp                              /* pt_regs->bp */
-       pushq %r12                              /* pt_regs->r12 */
-       pushq %r13                              /* pt_regs->r13 */
-       pushq %r14                              /* pt_regs->r14 */
-       pushq %r15                              /* pt_regs->r15 */
-
-       cmpq $14,%rsi           /* Page fault? */
+       ALLOC_ENTRY_FRAME addskip=-8
+       SAVE_ENTRY_FRAME_POINTER
+       movq %rbx, PT_REGS(RBX) /* save rbx */
+       movq PT_REGS(RDI), %rbx /* rbx = vector number */
+
+       SAVE_C_REGS
+       SAVE_EXTRA_REGS_EXCEPT_RBX
+
+       cmpq $14, %rbx          /* page fault? */
        jnz 10f
        GET_CR2_INTO(%rdi)      /* Can clobber any volatile register if pv */
        call early_make_pgtable
@@ -387,7 +378,9 @@ early_idt_handler_common:
        jz 20f                  /* All good */
 
 10:
-       movq %rsp,%rdi          /* RDI = pt_regs; RSI is already trapnr */
+       movq %rsp, %rdi
+       addq $PT_REGS_OFFSET, %rdi /* rdi = pt_regs */
+       movq %rbx, %rsi            /* rsi = vector number */
        call early_fixup_exception
 
 20:
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 00f03d8..6954e74 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -514,22 +514,34 @@ exit:
 NOKPROBE_SYMBOL(do_int3);
 
 #ifdef CONFIG_X86_64
+
+struct entry_frame {
+#ifdef CONFIG_FRAME_POINTER
+       void *fp;
+       void *ret_addr;
+#endif
+       struct pt_regs regs;
+};
+
 /*
  * Help handler running on IST stack to switch off the IST stack if the
  * interrupted code was in user mode. The actual stack switch is done in
  * entry_64.S
  */
-asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
+asmlinkage __visible notrace
+struct entry_frame *sync_entry_frame(struct entry_frame *old)
 {
-       struct pt_regs *regs = task_pt_regs(current);
-       *regs = *eregs;
-       return regs;
+       struct entry_frame *new = container_of(task_pt_regs(current),
+                                              struct entry_frame, regs);
+
+       *new = *old;
+       return new;
 }
-NOKPROBE_SYMBOL(sync_regs);
+NOKPROBE_SYMBOL(sync_entry_frame);
 
 struct bad_iret_stack {
        void *error_entry_ret;
-       struct pt_regs regs;
+       struct entry_frame frame;
 };
 
 asmlinkage __visible notrace
@@ -544,15 +556,15 @@ struct bad_iret_stack *fixup_bad_iret(struct 
bad_iret_stack *s)
         */
        struct bad_iret_stack *new_stack =
                container_of(task_pt_regs(current),
-                            struct bad_iret_stack, regs);
+                            struct bad_iret_stack, frame.regs);
 
        /* Copy the IRET target to the new stack. */
-       memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
+       memmove(&new_stack->frame.regs.ip, (void *)s->frame.regs.sp, 5*8);
 
        /* Copy the remainder of the stack from the current stack. */
-       memmove(new_stack, s, offsetof(struct bad_iret_stack, regs.ip));
+       memmove(new_stack, s, offsetof(struct bad_iret_stack, frame.regs.ip));
 
-       BUG_ON(!user_mode(&new_stack->regs));
+       BUG_ON(!user_mode(&new_stack->frame.regs));
        return new_stack;
 }
 NOKPROBE_SYMBOL(fixup_bad_iret);
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Reply via email to