Thomas Gleixner <[email protected]> writes: > If seccomp overwrites regs->eax and aborts any syscall (including -1) by > returning -1, then the value seccomp wrote into regs->eax is preserved > and returned to user space. > > The same applies for syscall_user_dispatch() and ptrace...() if they > decide to overwrite regs->eax _and_ abort the syscall by letting > syscall_enter_from_user_mode() return -1. > > trace_syscall_enter() is not any different. If the magic BPF in there > rewrites the syscall number to -1 then either the original -ENOSYS or > the BPF induced overwrite is returned to user space. > > It's less than obvious and I have no objections to clean that up and > make it more intuitive, but I still fail to see what Michal is actually > trying to solve and what the magic flag is for. If s390 requires it, > then that's an s390 problem, but definitely x86 does not.
The difference between x86 and s390 is that on s390, regs->gprs[2] is used for both the syscall number and the syscall return value. That was a design mistake early in the begin about 25 years ago, but it's ABI now, so it cannot be changed. When seccomp decides to skip a syscall, it write a return value into regs->gprs[2]. When syscall_enter_from_user_mode_work() returns, it returns this number. If it's negative all is good - the 'if (likely(nr < NR_syscalls))' conditiion would just catch it and skip the syscall. But if it's a positive number, the code cannot distinguish whether that's a return value or a syscall number. So I introduced PIF_SYSCALL_RET_SET when converting s390 to generic entry. This flag tells the syscall code that a return value was set in ptregs and the syscall should be skipped. I'd like to see something like the change from Michal going in - cleaned up of course. It would allow us to get rid of PIF_SYSCALL_RET_SET.
