On Tue, Mar 19, 2019 at 3:14 PM Christian Brauner <christ...@brauner.io> wrote: > So I dislike the idea of allocating new inodes from the procfs super > block. I would like to avoid pinning the whole pidfd concept exclusively > to proc. The idea is that the pidfd API will be useable through procfs > via open("/proc/<pid>") because that is what users expect and really > wanted to have for a long time. So it makes sense to have this working. > But it should really be useable without it. That's why translate_pid() > and pidfd_clone() are on the table. What I'm saying is, once the pidfd > api is "complete" you should be able to set CONFIG_PROCFS=N - even > though that's crazy - and still be able to use pidfds. This is also a > point akpm asked about when I did the pidfd_send_signal work.
I agree that you shouldn't need CONFIG_PROCFS=Y to use pidfds. One crazy idea that I was discussing with Joel the other day is to just make CONFIG_PROCFS=Y mandatory and provide a new get_procfs_root() system call that returned, out of thin air and independent of the mount table, a procfs root directory file descriptor for the caller's PID namspace and suitable for use with openat(2). C'mon: /proc is used by everyone today and almost every program breaks if it's not around. The string "/proc" is already de facto kernel ABI. Let's just drop the pretense of /proc being optional and bake it into the kernel proper, then give programs a way to get to /proc that isn't tied to any particular mount configuration. This way, we don't need a translate_pid(), since callers can just use procfs to do the same thing. (That is, if I understand correctly what translate_pid does.) We still need a pidfd_clone() for atomicity reasons, but that's a separate story. My goal is to be able to write a library that transparently creates and manages a helper child process even in a "hostile" process environment in which some other uncoordinated thread is constantly doing a waitpid(-1) (e.g., the JVM). > So instead of going throught proc we should probably do what David has > been doing in the mount API and come to rely on anone_inode. So > something like: > > fd = anon_inode_getfd("pidfd", &pidfd_fops, file_priv_data, flags); > > and stash information such as pid namespace etc. in a pidfd struct or > something that we then can stash file->private_data of the new file. > This also lets us avoid all this open coding done here. > Another advantage is that anon_inodes is its own kernel-internal > filesystem. Sure. That works too.