On Thu, Apr 2, 2015 at 6:37 AM, Ingo Molnar <mi...@kernel.org> wrote: > > * Denys Vlasenko <dvlas...@redhat.com> wrote: > >> On 04/02/2015 11:07 AM, Ingo Molnar wrote: >> > >> > * Andy Lutomirski <l...@kernel.org> wrote: >> > >> >> When I wrote the opportunistic SYSRET code, I missed an important >> >> difference between SYSRET and IRET. Both instructions are capable >> >> of setting EFLAGS.TF, but they behave differently when doing so. >> >> IRET will not issue a #DB trap after execution when it sets TF This >> >> is critical -- otherwise you'd never be able to make forward >> >> progress when returning to userspace. SYSRET, on the other hand, >> >> will trap with #DB immediately after returning to CPL3, and the next >> >> instruction will never execute. >> >> >> >> This breaks anything that opportunistically SYSRETs to a user >> >> context with TF set. For example, running this code with TF set and >> >> a SIGTRAP handler loaded never gets past post_nop. >> >> >> >> extern unsigned char post_nop[]; >> >> asm volatile ("pushfq\n\t" >> >> "popq %%r11\n\t" >> >> "nop\n\t" >> >> "post_nop:" >> >> : : "c" (post_nop) : "r11"); >> >> >> >> In my defense, I can't find this documented in the AMD or Intel >> >> manual. >> >> >> >> Fix it by using IRET to restore TF. >> >> >> >> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when >> >> possible >> >> Signed-off-by: Andy Lutomirski <l...@kernel.org> >> >> --- >> >> >> >> This affects 4.0-rc as well as -tip. A full test case lives here: >> >> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/ >> >> >> >> It's called single_step_syscall_64. >> >> >> >> On Intel systems, the 32-bit version of that test fails for unrelated >> >> reasons, but that's not a regression, and fixing it will be much more >> >> intrusive. >> >> >> >> Changes from v1: >> >> - Remove mention of testl from changelog. >> >> - Improve comment per Denys' suggestion. >> >> >> >> arch/x86/kernel/entry_64.S | 16 +++++++++++++++- >> >> 1 file changed, 15 insertions(+), 1 deletion(-) >> >> >> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S >> >> index 750c6efcb718..537716380959 100644 >> >> --- a/arch/x86/kernel/entry_64.S >> >> +++ b/arch/x86/kernel/entry_64.S >> >> @@ -715,7 +715,21 @@ retint_swapgs: /* return to >> >> user-space */ >> >> cmpq %r11,EFLAGS(%rsp) /* R11 == RFLAGS */ >> >> jne opportunistic_sysret_failed >> >> >> >> - testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */ >> >> + /* >> >> + * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET, >> >> + * restoring TF results in a trap from userspace immediately after >> >> + * SYSRET. This would cause an infinite loop whenever #DB happens >> >> + * with register state that satisfies the opportunistic SYSRET >> >> + * conditions. For example, single-stepping this user code: >> >> + * >> >> + * movq $stuck_here,%rcx >> >> + * pushfq >> >> + * popq %r11 >> >> + * stuck_here: >> >> + * >> >> + * would never get past stuck_here. >> >> + */ >> >> + testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11 >> >> jnz opportunistic_sysret_failed >> > >> > So I merged this as it's an obvious bugfix, but in hindsight I'm >> > really uneasy about the whole opportunistic SYSRET concept: it appears >> > that the chance that %rcx matches return-%rip is astronomical - this >> > is why this bug wasn't noticed live so far. >> > >> > So should we really be doing this? >> >> Andy does this not for the off-chance that userspace's RCX is equal >> to return address and R11 == RFLAGS. The chances of that are >> astronomically small. >> >> This code path triggers when ptrace/audit/seccomp is active. Instead >> of torturing ourselves trying to not divert into IRET return, now >> code is steered that way. But then immediately before actual IRET, >> we check again: "do we really need IRET?" IOW "did ptrace really >> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of >> cases will not be true. > > I keep forgetting about that, my test systems have the audit muck > turned off ;-) > > Fair enough - and it's sensible to share the IRET path between > interrupts and complex-return system calls, even though the check > is unnecessary overhead for the pure interrupt return path...
Maybe we could reintroduce TIF_IRET for this purpose instead of (ab)using TIF_NOTIFY_RESUME. Then we would only do the opportunistic check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip it for interrupts. -- Brian Gerst -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/