On Fri, 24 Nov 2017, Ingo Molnar wrote: > From: Andy Lutomirski <l...@kernel.org> > > Handling SYSCALL is tricky: the SYSCALL handler is entered with every > single register (except FLAGS), including RSP, live. It somehow needs > to set RSP to point to a valid stack, which means it needs to save the > user RSP somewhere and find its own stack pointer. The canonical way > to do this is with SWAPGS, which lets us access percpu data using the > %gs prefix. > > With KAISER-like pagetable switching, this is problematic. Without a > scratch register, switching CR3 is impossible, so %gs-based percpu > memory would need to be mapped in the user pagetables. Doing that > without information leaks is difficult or impossible. > > Instead, use a different sneaky trick. Map a copy of the first part > of the SYSCALL asm at a different address for each CPU. Now RIP > varies depending on the CPU, so we can use RIP-relative memory access > to access percpu memory. By putting the relevant information (one > scratch slot and the stack address) at a constant offset relative to > RIP, we can make SYSCALL work without relying on %gs.
Smart! > A nice thing about this approach is that we can easily switch it on > and off if we want pagetable switching to be configurable. > > The compat variant of SYSCALL doesn't have this problem in the first > place -- there are plenty of scratch registers, since we don't care > about preserving r8-r15. This patch therefore doesn't touch SYSCALL32 > at all. > > XXX: Whenever we settle how KAISER gets turned on and off, we should do > the same to this. > > Signed-off-by: Andy Lutomirski <l...@kernel.org> Reviewed-by: Thomas Gleixner <t...@linutronix.de>