Re: [PATCH -next 00/22] remove in-kernel syscall invocations (part 2 == netdev)

Dominik Brodowski Fri, 16 Mar 2018 13:17:19 -0700

On Fri, Mar 16, 2018 at 02:30:21PM -0400, David Miller wrote:
> From: Dominik Brodowski <li...@dominikbrodowski.net>
> Date: Fri, 16 Mar 2018 18:05:52 +0100
> 
> > The rationale of this change is described in patch 1 of part 1[*] as 
> > follows:
> > 
> >     The syscall entry points to the kernel defined by SYSCALL_DEFINEx()
> >     and COMPAT_SYSCALL_DEFINEx() should only be called from userspace
> >     through kernel entry points, but not from the kernel itself. This
> >     will allow cleanups and optimizations to the entry paths *and* to
> >     the parts of the kernel code which currently need to pretend to be
> >     userspace in order to make use of syscalls.
> > 
> > At present, these patches are based on v4.16-rc5; there is one trivial
> > conflict against net-next. Dave, I presume that you prefer to take them
> > through net-next? If you want to, I can re-base them against net-next.
> > If you prefer otherwise, though, I can route them as part of my whole
> > syscall series.
> 
> So the transformations themeselves are relatively trivial, so on that
> aspect I don't have any problems with these changes.


Thank you for your fast feedback.

> But overall I have to wonder.
> 
> I imagine one of the things you'd like to do is declare that syscall
> entries use a different (better) argument passing scheme.  For
> example, passing values in registers instead of on the stack.

Well, sort of. Currently, x86-64 decodes all six registers unconditionally:

                regs->ax = sys_call_table[nr](
                        regs->di, regs->si, regs->dx,
                        regs->r10, regs->r8, regs->r9);

so that in do_syscall_64(), we have to get six parameters from the
stack:

        mov    0x38(%rbx),%rcx
        mov    0x60(%rbx),%rdx
        mov    0x68(%rbx),%rsi
        mov    0x70(%rbx),%rdi
        mov    0x40(%rbx),%r9
        mov    0x48(%rbx),%r8

Instead, the aim is to do

        regs->ax = sys_call_table[nr](regs)

... which results in just a register rename operation:

        mov    %rbp,%rdi

> But in situations where you split out the system call function
> completely into one of these "helpers", the compiler is going
> to have two choices:
> 
> 1) Expand the helper into the syscall function inline, thus we end up
>    with two copies of the function.

That's only sensible for very short stubs, which just call another function
(e.g. __compat_sys_sendmsg()).

> 2) Call the helper from the syscall function.  Well, then the compiler
>    will need to pop the syscal obtained arguments from the registers
>    onto the stack.
> 
> So this doesn't seem like such a total win to me.
> 
> Maybe you can explain things better to ease my concerns.

For example, for sys_recv() and sys_recvfrom(), if all is complete, this
results in:

sys_x86_64_recv:
        callq <__fentry__>
        /* decode struct pt_regs for exactly those parameters
         * we care about
         */
        mov    0x38(%rdi),%rcx
        xor    %r9d,%r9d
        xor    %r8d,%r8d
        mov    0x60(%rdi),%rdx
        mov    0x68(%rdi),%rsi
        mov    0x70(%rdi),%rdi

        /* call __sys_recvfrom */
        callq  <__sys_recvfrom>

        /* cleanup and return */
        cltq
        retq

That's only obtaining four entries from the stack, and two register clearing
operations; sys_x86_64_recvfrom is similar (6 movs from stack, one register
rename mov, no xor).

__sys_recvfrom() then does the actual work, starting with pushing some
register contect out of the way and moving registers around, more or less
what SyS_recvfrom() does today.

So the result is nothing spectacular or unusual, but pretty equivalent and
possibly even shorter compared to current codepath.

Thanks,
        Dominik

Re: [PATCH -next 00/22] remove in-kernel syscall invocations (part 2 == netdev)

Reply via email to