* Ingo Molnar <mi...@kernel.org> wrote: > Doing that would give us four (theoretical) performance advantages: > > - No implicit irq disabling overhead when the syscall instruction is > executed: we could change MSR_SYSCALL_MASK from 0xc0000084 to > 0xc0000284, which removes the implicit CLI on syscall entry. > > - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in > system_call. > > - No explicit irq disabling overhead in the ret_from_sys_call fast > path, i.e. no DISABLE_INTERRUPTS() [CLI]. > > - No implicit irq enabling overhead in ret_from_sys_call's > USERGS_SYSRET64: the SYSRETQ instruction would not have to > re-enable irqs as the user-space IF in R11 would match that of the > current IF. > > whether that's an actual performance win in practice as well needs > to be measured, but I'd be (very!) shocked if it wasn't in the 20+ > cycles range: which is absolutely huge in terms of system_call > optimizations.
So just to quantify the potential 64-bit system call entry fast path performance savings a bit, I tried to simulate the effects in user-space via a 'best case' simulation, where we do a PUSHFQ+CLI+STI ... CLI+POPFQ simulated syscall sequence (beginning and end sufficiently far from each other to not be interacting), on Intel family 6 model 62 CPUs (slightly dated but still relevant): with irq disabling/enabling: new best speed: 2710739 loops (158 cycles per iteration). fully preemptible: new best speed: 3389503 loops (113 cycles per iteration). now that's an about 40 cycles difference, but admittedly the cost very much depends on the way we save flags and on the way we restore flags and depends on how intelligently the CPU can hide the irq disabling and the restoration amongst other processing it has to do on entry/exit, which it can do pretty well in a number of important cases. I don't think I can simulate the real thing in user-space: - The hardest bit to simulate is SYSRET: POPFQ is expensive, but SYSRET might be able to 'cheat' on the enabling side - I _think_ it cannot cheat because user-space might have come in with irqs disabled itself (we still have iopl(3)), so it's a POPFQ equivalent instruction. - OTOH the CPU might be able to hide the latency of the POPFQ amongst other SYSRET return work (which is significant) - so this is really hard to estimate. So "we'll have to try it to see it" :-/ [and maybe Intel knows.] But even if just half of the suspected savings can be realized: a 20 cycles speedup is very tempting IMHO, given that our 64-bit system calls cost around 110 cycles these days. Yes, it's scary, crazy, potentially fragile, might not even work, etc. - but it's also very tempting nevertheless ... So I'll try to write a prototype of this, just to be able to get some numbers - but shoot me down if you think I'm being stupid and if the concept is an absolute non-starter to begin with! Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/