* Ingo Molnar <mi...@kernel.org> wrote:

> Doing that would give us four (theoretical) performance advantages:
> 
>   - No implicit irq disabling overhead when the syscall instruction is
>     executed: we could change MSR_SYSCALL_MASK from 0xc0000084 to
>     0xc0000284, which removes the implicit CLI on syscall entry.
> 
>   - No explicit irq enabling overhead via ENABLE_INTERRUPTS() [STI] in
>     system_call.
> 
>   - No explicit irq disabling overhead in the ret_from_sys_call fast 
>     path, i.e. no DISABLE_INTERRUPTS() [CLI].
> 
>   - No implicit irq enabling overhead in ret_from_sys_call's 
>     USERGS_SYSRET64: the SYSRETQ instruction would not have to 
>     re-enable irqs as the user-space IF in R11 would match that of the 
>     current IF.
> 
> whether that's an actual performance win in practice as well needs 
> to be measured, but I'd be (very!) shocked if it wasn't in the 20+ 
> cycles range: which is absolutely huge in terms of system_call 
> optimizations.

So just to quantify the potential 64-bit system call entry fast path 
performance savings a bit, I tried to simulate the effects in 
user-space via a 'best case' simulation, where we do a PUSHFQ+CLI+STI 
... CLI+POPFQ simulated syscall sequence (beginning and end 
sufficiently far from each other to not be interacting), on Intel 
family 6 model 62 CPUs (slightly dated but still relevant):

with irq disabling/enabling:

  new best speed: 2710739 loops (158 cycles per iteration).

fully preemptible:

  new best speed: 3389503 loops (113 cycles per iteration).

now that's an about 40 cycles difference, but admittedly the cost very 
much depends on the way we save flags and on the way we restore flags 
and depends on how intelligently the CPU can hide the irq disabling 
and the restoration amongst other processing it has to do on 
entry/exit, which it can do pretty well in a number of important 
cases.

I don't think I can simulate the real thing in user-space:

  - The hardest bit to simulate is SYSRET: POPFQ is expensive, but 
    SYSRET might be able to 'cheat' on the enabling side

  - I _think_ it cannot cheat because user-space might have come in 
    with irqs disabled itself (we still have iopl(3)), so it's a POPFQ
    equivalent instruction.

  - OTOH the CPU might be able to hide the latency of the POPFQ 
    amongst other SYSRET return work (which is significant) - so this 
    is really hard to estimate.

So "we'll have to try it to see it" :-/ [and maybe Intel knows.]

But even if just half of the suspected savings can be realized: a 20 
cycles speedup is very tempting IMHO, given that our 64-bit system 
calls cost around 110 cycles these days.

Yes, it's scary, crazy, potentially fragile, might not even work, etc. 
- but it's also very tempting nevertheless ...

So I'll try to write a prototype of this, just to be able to get some 
numbers - but shoot me down if you think I'm being stupid and if the 
concept is an absolute non-starter to begin with!

Thanks,

        Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to