On Wed, Oct 31, 2018 at 2:25 PM, David Laight <david.lai...@aculab.com> wrote: > From: Daniel Colascione >> Sent: 31 October 2018 12:56 >> On Wed, Oct 31, 2018 at 12:27 PM, David Laight <david.lai...@aculab.com> >> wrote: >> > From: Daniel Colascione >> >> Sent: 29 October 2018 17:53 >> >> >> >> This patch adds a new file under /proc/pid, /proc/pid/exithand. >> >> Attempting to read from an exithand file will block until the >> >> corresponding process exits, at which point the read will successfully >> >> complete with EOF. The file descriptor supports both blocking >> >> operations and poll(2). It's intended to be a minimal interface for >> >> allowing a program to wait for the exit of a process that is not one >> >> of its children. >> > >> > Why do you need an extra file? >> >> Because no current file suffices. > > That doesn't stop you making something work on any/all of the existing files.
Why overload? >> > It ought to be possible to use poll() to wait for POLLERR having set >> > 'events' to zero on any of the nodes in /proc/pid - or even on >> > the directory itself. >> >> That doesn't actually work today. And waiting on a directory with >> POLLERR would be very weird, since directories in general don't do >> things like blocking reads or poll support. A separate file with >> self-contained, well-defined semantics is cleaner. > > Device drivers will (well ought to) return POLLERR when a device > is removed. > Making procfs behave the same way wouldn't be too stupid. That's overkill. You'd need to add poll support to files throughout /proc/pid, and I don't think doing so would add any new capabilities over keeping the changes localized to one new place. >> > Indeed, to avoid killing the wrong process you need to have opened >> > some node of /proc/pid/* (maybe cmdline) before sending the kill >> > signal. >> >> The kernel really needs better documentation of the semantics of >> procfs file descriptors. You're not the only person to think, >> mistakenly, that keeping a reference to a /proc/$PID/something FD >> reserves $PID and prevents it being used for another process. Procfs >> FDs do no such thing. kill(2) is unsafe whether or not >> /proc/pid/cmdline or any other /proc file is open. > > Interesting. > Linux 'fixed' the problem of pid reuse in the kernel by adding (IIRC) > 'struct pid' that reference counts the pid stopping reuse. Struct pid doesn't stop PID reuse. It just allows the kernel to distinguish between a stale and current reference to a given PID. In a sense, the "struct pid*" is the "real" name of a process and the numeric PID is just a convenient way to find a struct pid. > But since the pids are still allocated sequentially userspace can > still reference a pid that is freed and immediately reused. > I'd have thought that procfs nodes held a reference count on the 'struct pid'. > There's probably no reason why it shouldn't. > > Annoyingly non-GPL drivers can't release references to 'struct pid' so > are very constrained about which processes they can signal. If I had my way, I'd s/EXPORT_SYMBOL_GPL/EXPORT_SYMBOL/ in pid.c. These routines seem generally useful. As an alternative, I suppose drivers could just use the new /proc/pid/kill interface, with sufficient contortions, just as userspace can. > I also managed to use a stale 'struct pid' and kill the wrong process > - much more likely that the pid number being reused. That shouldn't be possible, unless by "stale 'struct pid'" you mean a struct pid* referring to a struct pid that's been released and reused for some other object (possibly a different struct pid instances), and that's UB. > If you look at the NetBSD pid allocator you'll see that it uses the > low pid bits to index an array and the high bits as a sequence number. > The array slots are also reused LIFO, so you always need a significant > number of pid allocate/free before a number is reused. > The non-sequential allocation also makes it significantly more difficult > to predict when a pid will be reused. > The table size is doubled when it gets nearly full. NetBSD is still just papering over the problem. The real issue is that the whole PID-based process API model is unsafe, and a clever PID allocator doesn't address the fundamental race condition. As long as PID reuse is possible at all, there's a potential race condition, and correctness depends on hope. The only way you could address the PID race problem while not changing the Unix process API is by making pid_t ridiculously wide so that it never wraps around.