On Fri, Jul 13, 2012 at 7:48 PM, Will Drewry <w...@chromium.org> wrote:
> On Fri, Jul 13, 2012 at 6:00 PM, Andrew Lutomirski <l...@mit.edu> wrote:
>> On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry <w...@chromium.org> wrote:
>>> If a seccomp filter program is installed, older static binaries and
>>> distributions with older libc implementations (glibc 2.13 and earlier)
>>> that rely on vsyscall use will be terminated regardless of the filter
>>> program policy when executing time, gettimeofday, or getcpu.  This is
>>> only the case when vsyscall emulation is in use (vsyscall=emulate is the
>>> default).
>>>
>>> This patch emulates system call entry inside a vsyscall=emulate by
>>> populating regs->ax and regs->orig_ax with the system call number prior
>>> to calling into seccomp such that all seccomp-dependencies function
>>> normally.  Additionally, system call return behavior is emulated in line
>>> with other vsyscall entrypoints for the trace/trap cases.
>>>
>>> Reported-by: Owen Kibel <qme...@gmail.com>
>>> Signed-off-by: Will Drewry <w...@chromium.org>
>>>
>>> v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu)
>>
>>> @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
>>> long address)
>>>
>>>         current_thread_info()->sig_on_uaccess_error = 
>>> prev_sig_on_uaccess_error;
>>>
>>> +       if (skip) {
>>> +               if ((long)regs->ax <= 0L) /* seccomp errno emulation */
>>> +                       goto do_ret;
>>> +               goto done; /* seccomp trace/trap */
>>> +       }
>>> +
>>>         if (ret == -EFAULT) {
>>>                 /* Bad news -- userspace fed a bad pointer to a vsyscall. */
>>>                 warn_bad_vsyscall(KERN_INFO, regs,
>>> @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
>>> long address)
>>>
>>>         regs->ax = ret;
>>>
>>> +do_ret:
>>>         /* Emulate a ret instruction. */
>>>         regs->ip = caller;
>>>         regs->sp += 8;
>>> -
>>> +done:
>>>         return true;
>>>
>>>  sigsegv:
>>> --
>>> 1.7.9.5
>>>
>>
>> This has the same odd property as the sigsegv path that the faulting
>> instruction will appear to be the mov, not the syscall.  That seems to
>> be okay, though -- various pieces of code that try to restart the segv
>> are okay with that.
>
> Yeah - I would otherwise do
>   regs->ip += 9;
> but I wanted to match the code that was therefor SIGSEGV.  If regs->ip
> += 9 _just_ for the SIGSYS case is fine, then I'll make that change
> shortly.  Since any code that sees the vsyscall address should be wise
> enough to avoid it, perhaps that's why the SIGSEGV hasn't had a
> problem so far.

I dashed this off without more thought. It's best to leave it as is
because any return to the emulated page will cause a vsyscall fault
event.

>> Is there any code that assumes that changing rax (i.e. the syscall
>> number) and restarting a syscall after SIGSYS will invoke the new
>> syscall?  (The RET_TRACE path might be similar -- does the
>> ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger
>> a chance to synchronously cancel or change the syscall?
>
> Unfortunately, it does in normal interception. I don't see any way out
> of that quirk with vsyscall=emulate.  As is without seccomp,
> vsyscall=emulate doesn't allow ptrace interception (or syscall
> auditing for that matter) while vsyscall=native does.   So the option
> here is to document the quirky interaction in
> Documentation/prctl/seccomp_filter.txt.  In particular, if the tracer
> sees either (time|gettimeofday|getcpu) and rip in the vsyscall page,
> it will know it can't rewrite or bypass the call.    Is there a better
> option?
>
> Given that, I will include a tweak to the documentation to indicate
> that behavior so that userspace authors of BPF programs that use
> SECCOMP_RET_TRACE will be aware of the behavior.
>
>> If those issues aren't problems, then:
>>
>> Reviewed-by: Andy Lutomirski <l...@amacapital.net>
>>
>> (If the syscall number needs to change after the fact in the
>> SECCOMP_RET_TRAP case, it'll be a mess.)
>
> Nah - traps are delivered like the forced sigsegv path.
>
> I'll spin a v3 soon including the documentation tweak and the ip
> offset to match vsyscall=native behavior (regs->ip += 9 _just_ for the
> skip case).  Of course, any better ideas for the trace-case will be
> more than welcome, but it seems to me to be an acceptable tradeoff - I
> hope others agree.
>
> I'll make the changes and then put it through its paces to see if any
> other little idiosyncrasies emerge.

I've written up a documentation patch to accompany this one. It
reflects one more change I've made in a v3 of the patch, but it is
optional.  I've added support for SECCOMP_RET_TRACE to still
skip/emulate the system call if it desires.  In v2 it can't.  Either
way is fine in practice, but I'd need to change the accompanying
documentation.

thanks again!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to