On Fri, Mar 16, 2018 at 11:30 AM, David Miller <da...@davemloft.net> wrote: > > I imagine one of the things you'd like to do is declare that syscall > entries use a different (better) argument passing scheme. For > example, passing values in registers instead of on the stack.
Actually, it's almost exactly the reverse. On x86-64, we'd like to just pass the 'struct pt_regs *' pointer, and have the sys_xyz() function itself just pick out the arguments it needs from there. That has a few reasons for it: - we can clear all registers at system call entry, which helps defeat some of the "pass seldom used register with user-controlled value that survives deep into the callchain" things that people used to leak information - we can streamline the low-level system call code, which needs to pass around 'struct pt_regs *' anyway, and the system call only picks up the values it actually needs - it's really quite easy(*) to just make the SYSCALL_DEFINEx() macros just do it all with a wrapper inline function but it fundamentally means that you *cannot* call 'sys_xyz()' from within the kernel, unless you then do it with something crazy like struct pt_regs myregs; ... fill in the right registers for this architecture _if_ this architecture uses ptregs .. sys_xyz(®s); which I somehow really doubt you want to do in the networking code. Now, I did do one version that just created two entrypoints for every single system call - the "kernel version" and the "real" system call version. That sucks, because you have two choices: - either pointlessly generate extra code for the 200+ system calls that are *not* used by the kernel - or let gcc just merge the two, and make code generation suck where the real system call just loads the registers and jumps to the common code. That second option really does suck, because if you let the compiler just generate the _single_ system call, it will do the "load actual value from ptregs" much more nicely, and only when it needs it, and schedules it all into the system call code. So just making the rule be: "you mustn't call the SYSCALL_DEFINEx() functions from anything but the system call code" really makes everything better. Then you only need to fix up the *handful* of so system calls that actually have in-kernel callers. Many of them end up being things that could be improved on further anyway (ie there's discussion about further cleanup and trying to avoid using "set_fs()" for arguments etc, because there already exists helper functions that take the kernel-space versions, and the sys_xyz() version is actually just going through stupid extra work for a kernel user). Linus (*) The "really quite easy" is only true on 64-bit architectures. 32-bit architectures have issues with packing 64-bit values into two registers, so using macro expansion with just the number of arguments doesn't work.