Thomas Gleixner <[email protected]> writes:

> If seccomp overwrites regs->eax and aborts any syscall (including -1) by
> returning -1, then the value seccomp wrote into regs->eax is preserved
> and returned to user space.
>
> The same applies for syscall_user_dispatch() and ptrace...() if they
> decide to overwrite regs->eax _and_ abort the syscall by letting
> syscall_enter_from_user_mode() return -1.
>
> trace_syscall_enter() is not any different. If the magic BPF in there
> rewrites the syscall number to -1 then either the original -ENOSYS or
> the BPF induced overwrite is returned to user space.
>
> It's less than obvious and I have no objections to clean that up and
> make it more intuitive, but I still fail to see what Michal is actually
> trying to solve and what the magic flag is for. If s390 requires it,
> then that's an s390 problem, but definitely x86 does not.

The difference between x86 and s390 is that on s390, regs->gprs[2] is
used for both the syscall number and the syscall return value.
That was a design mistake early in the begin about 25 years ago, but
it's ABI now, so it cannot be changed.

When seccomp decides to skip a syscall, it write a return value into
regs->gprs[2]. When syscall_enter_from_user_mode_work() returns, it
returns this number. If it's negative all is good - the 'if (likely(nr <
NR_syscalls))' conditiion would just catch it and skip the syscall.

But if it's a positive number, the code cannot distinguish whether
that's a return value or a syscall number.

So I introduced PIF_SYSCALL_RET_SET when converting s390 to generic
entry. This flag tells the syscall code that a return value was set in
ptregs and the syscall should be skipped.

I'd like to see something like the change from Michal going in - cleaned
up of course. It would allow us to get rid of PIF_SYSCALL_RET_SET.

Reply via email to