Jürg Billeter <j...@bitron.ch> writes: > On Tue, 2017-10-03 at 09:46 -0500, Eric W. Biederman wrote: >> There is a general need to find out about the death of other processes, >> if you are not the parent of the process. I would be inclined to call >> it waitfd. Something that you give a pid. It performs a permission >> check and the pid becomes readable when the process dies. With poll >> working on the fd, and the fd returning wstatus of the dead child. >> >> Support SIGIO on the fd and you have a signal delivery mechanism, >> if you want it. > > File descriptors for processes (waitfd/clonefd) are definitely > interesting. Especially if reaping the process (and reparenting its > children) is delayed until the last process file descriptor is closed. > However, this would be a much larger addition and also less intuitive > to use if all you want is killing the process tree. > >> For the kill all children when the parent dies the mechanism you are >> proposing is escapable. We already have an inescapable version of it >> with init in a pid namespace. We already have an escapable version of >> it with orphaned process groups and SIGHUP. >> >> So I would really appreciate a very clear use case for what we are >> building here. As it appears the killing of children can already be >> done another way, and that the waiting for the parent can be done better >> another way. > > My use case is to provide a way for a process to spawn a child and > ensure that no descendants survive when that child dies. Avoiding > runaway processes is desirable in many situations. My motivation is > very lightweight (nested) sandboxing (every process is potentially > sandboxed). > > I.e., pid namespaces would be a pretty good fit (assuming they are > sufficiently lightweight) but CLONE_NEWPID requires CAP_SYS_ADMIN. > User namespaces can help here, but creating tons of user namespaces > just for this doesn't sound sensible. MAX_PID_NS_LEVEL could be an > issue as well at some point but 32 levels are likely fine in practice. > > For my particular scenario I may actually be able to create a single > user namespace, run all processes with (namespaced) CAP_SYS_ADMIN and > use CLONE_NEWPID for every process. However, I would prefer not > requiring CAP_SYS_ADMIN and a regular application that wants to avoid > runaway processes for a spawned helper process cannot rely on > CAP_SYS_ADMIN. > > My plan was to use PR_SET_PDEATHSIG_PROC with PR_NO_NEW_PRIVS and a > suitable seccomp filter to prevent changes to pdeath_signal_proc. For > my SIGKILL use case it would be even better to simply require > PR_NO_NEW_PRIVS and make pdeath_signal_proc sticky, avoiding the need > for seccomp. I wanted to keep the differences to the existing > PR_SET_PDEATHSIG minimal but if we argue that the non-SIGKILL use case > is better solved with waitfd (or maybe the process events connector), > we could tailor the prctl for the SIGKILL use case (or support both via > prctl arg3). > > I have another small patch locally that adds a prctl that restricts > kill(2) to direct children of the current thread group for lightweight > sandboxing. That would also be redundant if it was possible to use > CLONE_NEWPID for every process.
I believe the current default limits allow using CLONE_NEWPID for every process. The data structures seem light enough as well. > What's actually the reason that CLONE_NEWPID requires CAP_SYS_ADMIN? > Does CLONE_NEWPID pose any risks that don't exist for > CLONE_NEWUSER|CLONE_NEWPID? Assuming we can't simply drop the > CAP_SYS_ADMIN requirement, do you see a better solution for this use > case? CLONE_NEWPID without a permission check would allow runing a setuid root application in a pid namespace. Off the top of my head I can't think of a really good exploit. But when you mess up pid files, and hide information from a privileged application I can completely imagine forcing that application to misbehave in ways the attacker can control. Leading to bad things. Eric