On Fri, Feb 7, 2025 at 5:20 PM Eyal Birger <eyal.bir...@gmail.com> wrote: > On Fri, Feb 7, 2025 at 7:27 AM Jann Horn <ja...@google.com> wrote: > > > > On Sun, Feb 2, 2025 at 5:29 PM Eyal Birger <eyal.bir...@gmail.com> wrote: > > > uretprobe(2) is an performance enhancement system call added to improve > > > uretprobes on x86_64. > > > > > > Confinement environments such as Docker are not aware of this new system > > > call and kill confined processes when uretprobes are attached to them. > > > > FYI, you might have similar issues with Syscall User Dispatch > > (https://docs.kernel.org/admin-guide/syscall-user-dispatch.html) and > > potentially also with ptrace-based sandboxes, depending on what kinda > > processes you inject uprobes into. For Syscall User Dispatch, there is > > already precedent for a bypass based on instruction pointer (see > > syscall_user_dispatch()). > > Thanks. This is interesting. > > Do you know of confinement environments using this?
Not for Syscall User Dispatch; I think that was mostly intended for stuff like emulating Windows syscalls in WINE. I'm not sure who actually uses it, I just know a bit about the kernel side of it. >From what I know, ptrace sandboxing is a technique used by some configurations of gVisor (https://gvisor.dev/docs/architecture_guide/platforms/#ptrace), though now I see that that page says that this configuration is no longer supported. I am also not sure whether you'd ever have uprobes installed in files from which instructions are executed in this environment. > > > Since uretprobe is a "kernel implementation detail" system call which is > > > not used by userspace application code directly, pass this system call > > > through seccomp without forcing existing userspace confinement > > > environments > > > to be changed. > > > > This makes me feel kinda uncomfortable. The purpose of seccomp() is > > that you can create a process that is as locked down as you want; you > > can use it for some light limits on what a process can do (like in > > Docker), or you can use it to make a process that has access to > > essentially nothing except read(), write() and exit_group(). Even > > stuff like restart_syscall() and rt_sigreturn() is not currently > > excepted from that. > > Yes, this has been discussed at length in the threads mentioned > in the "Link" tags. > > > > > I guess your usecase is a little special in that you were already > > calling from userspace into the kernel with SWBP before, which is also > > not subject to seccomp; and the syscall is essentially an > > arch-specific hack to make the SWBP a little faster. > > Indeed. The uretprobe mechanism wasn't enforced by seccomp before > this syscall. This change preserves this. > > > > > If we do this, we should at least ensure that there is absolutely no > > way for anything to happen in sys_uretprobe when no uretprobes are > > configured for the process - the first check in the syscall > > implementation almost does that, but the implementation could be a bit > > stricter. It checks for "regs->ip != trampoline_check_ip()", but if no > > uprobe region exists for the process, trampoline_check_ip() returns > > `-1 + (uretprobe_syscall_check - uretprobe_trampoline_entry)`. So > > there is a userspace instruction pointer near the bottom of the > > address space that is allowed to call into the syscall if uretprobes > > are not set up. Though the mmap minimum address restrictions will > > typically prevent creating mappings there, and > > uprobe_handle_trampoline() will SIGILL us if we get that far without a > > valid uretprobe. > > I'm not sure I understand your point. If creating mappings in that > area is prevented, what is the issue? It is usually prevented, not always - root can do it depending on system configuration. Also, in a syscall like this that will be reachable in every sandbox, I think we should try to be more careful about edge cases and avoid things like this offset calculation on address -1. > also, this would be related to the > uretprobe syscall implementation in general, no? Yes. I just think it is relevant to the seccomp change because excepting a syscall from seccomp makes it more important that that syscall is robust and correct. > To me this seems unrelated to the seccomp change. > Jiri, do you have any input on this? > > Thanks! > Eyal.