On Tue, Sep 4, 2018 at 12:04 AM, Peter Zijlstra <pet...@infradead.org> wrote: > On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote: >> The SYSCALL64 trampoline has a couple of nice properties: >> >> - The usual sequence of SWAPGS followed by two GS-relative accesses to >> set up RSP is somewhat slow because the GS-relative accesses need >> to wait for SWAPGS to finish. The trampoline approach allows >> RIP-relative accesses to set up RSP, which avoids the stall. >> >> - The trampoline avoids any percpu access before CR3 is set up, >> which means that no percpu memory needs to be mapped in the user >> page tables. This prevents using Meltdown to read any percpu memory >> outside the cpu_entry_area and prevents using timing leaks >> to directly locate the percpu areas. >> >> The downsides of using a trampoline may outweigh the upsides, however. >> It adds an extra non-contiguous I$ cache line to system calls, and it >> forces an indirect jump to transfer control back to the normal kernel >> text after CR3 is set up. The latter is because x86 lacks a 64-bit >> direct jump instruction that could jump from the trampoline to the entry >> text. With retpolines enabled, the indirect jump is extremely slow. >> >> This patch changes the code to map the percpu TSS into the user page >> tables to allow the non-trampoline SYSCALL64 path to work under PTI. >> This does not add a new direct information leak, since the TSS is >> readable by Meltdown from the cpu_entry_area alias regardless. It >> does allow a timing attack to locate the percpu area, but KASLR is >> more or less a lost cause against local attack on CPUs vulnerable to >> Meltdown regardless. As far as I'm concerned, on current hardware, >> KASLR is only useful to mitigate remote attacks that try to attack >> the kernel without first gaining RCE against a vulnerable user >> process. >> >> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces >> syscall overhead from ~237ns to ~228ns. >> >> There is a possible alternative approach: we could instead move the >> trampoline within 2G of the entry text and make a separate copy for >> each CPU. Then we could use a direct jump to rejoin the normal >> entry path. > > Can we have a few words on why this solution and not this alternative? I > mean, you raise the possibility, but then surely you chose not to > implement that. Might as well share that with us.
I can give some pros and cons. With the other approach: - We avoid a pipeline stall. - We execute from an extra page and read from another extra page during the syscall. (The latter is because we need to use a relative addressing mode to find sp1 -- it's the same *cacheline* we'd use anyway, but we're accessing it using an alias, so it's an extra TLB entry.) - We use more memory. This would be one page per CPU for a simple implementation and 64-ish bytes per CPU or one page per node for a more complex implementation. - More code complexity. I'm not convinced this is a good tradeoff.