Hi all,

I am happy to see this thread appear. I emailed Christian and others ~5 years
ago about this in this thread[1]; it would be great to see it finally happen!

I very much agree that the new process spawning should be pidfd based. I also
want to emphasize that the crux of the matter is that code needed to set up the
initial unscheduled process --- which I do think should be "real state" and
more than a mere template --- is currently chopped up between clone and exec.
So the real meat of the implementation would be factoring out a bunch of stuff
so it can be reused in both the legacy clone+exec and modern code paths.

I'll say a bit more about this "real state" vs "mere template" distinction,
which is that the latter is effectively some sort of ad-hoc operation batching
language, and always runs the risk of falling behind what the kernel actually
supports. The "real state" approach, where we have honest-to-goodness process
state, just in some partially initialized fashion and thus it's not yet
scheduled, always supports everything the kernel supports in principle.

Yes, alternative syscalls that specify which "embryonic" process (as opposed to
always the current active process) need to be created, but that is less bad
than trying to stuff things into flags etc. for a single existing system call,
and also one can imagine a world (as described in
https://catern.com/rsys21.pdf) where the exact "which process?" parameter
starts getting added to new process modifying machinery by *default*, with a
sentinel value analogous to `AT_FDCWD` used to mean "the current process" for
the legacy used-between-fork-and-exec usecase.

---

Anyways, years ago, after taking a glance at the relevant code in Linux and
FreeBSD, I figured that it would be easier for me personally to first implement
this functionality in FreeBSD, and then, once I had a feel for some of the
refactoring, take a stab at it in Linux. This is because Linux's feature set,
especially things like `binfmt_misc`, makes its clone and exec quite a bit more
complex, and thus the (IMO) necessary heavy refactoring quite a bit more
extensive too.

I never got around to it in the 5 years, but these days, with LLMs, doing an
"exploratory refactor" (to get a sketch of a patch that is fodder for discussion
not yet fit for actual submission) is much easier. So inspired by this thread, I
took a few hours to do the exploratory FreeBSD refactor in [2]. The man page for
the new syscalls, [3], might be a good place to start reading. (This, being from
a FreeBSD patch, describes the change in terms of "proc fds", but the switch to
Linux's "pidfds" should be self-explanatory. The former after all inspired the
latter.)

Hope discussion of such a patch isn't too off topic here, but there is an
interesting thing to note that would also apply to a Linux implementation. It
took *more* factored out helper functions than I thought. The current count is
over 15(!) --- there didn't seem to be a way to build both the old and new way
of doing things with fewer, coarser building blocks. Now, granted, maybe
someone more familiar with either kernel than me could do a better job, but I
think it will still be a number of functions. This indicates just how much
untangling there is to do. And the number will surely be much higher for Linux.

[1]: 
https://lore.kernel.org/all/[email protected]/

[2]: https://github.com/obsidiansystems/freebsd-src/commit/better-proc-spawn
     239dcdefe6ad244e58d998155b527375e5293ff7 for posterity

[3]: 
https://raw.githubusercontent.com/obsidiansystems/freebsd-src/refs/heads/better-proc-spawn/lib/libsys/proc_new.2


On Sun, May 31, 2026, at 10:47 PM, Li Chen wrote:
> Hi Christian,
>
> Thanks a lot for your great review!
>
> ---- On Thu, 28 May 2026 19:02:53 +0800  Christian Brauner 
> <[email protected]> wrote ---
> > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> > > Hi,
> > >
> > > This is an early RFC for an idea that is probably still rough in both the
> > > UAPI and implementation details. Sorry for the rough edges; I am sending
> > > it now to check whether this direction is worth pursuing and to get
> > > feedback on the kernel/userspace boundary.
> >
> > The idea of having a builder api for exec isn't all that crazy. But it
> > should simply be built on top of pidfds and thus pidfs itself instead.
> > It has all the basic infrastructure in place already.
>
> Yes, that makes a lot more sense. I was staring too hard at the "hot
> executable" part and made the cache/template the API, which was probably
> the wrong thing to expose. Sorry about that.
>
> > Any implementation
> > should also allow userspace to implement posix_spawn() on top of it.
>
> That's so cool, and this is a really useful point. I had not thought about 
> this as
> something that could sit under posix_spawn(), but that makes the target
> much clearer. It should be a generic exec/spawn builder first, and the
> agent use case should just be one user of it.
>
> > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
> >
> > pidfd_config(fd, ...) // modeled similar to fsconfig()
>
> Reusing pidfd_open() with an empty target is nice because it keeps the API 
> close
> to pidfds, but I wonder if a separate entry point such as
> pidfd_spawn_open() or pidfd_create() would make the "new process
> builder" case a bit more explicit? Either way, the configuration side
> being fsconfig-like makes sense to me.

Yeah check out my syscalls [3] on that front. It's important to design the
workflow / state machine in a good way. Performance/efficiency, security (share
less state/privileges by default!), and extensibility (where will newer
concepts, like a new type of namespace, fit in?) are all competing concerns,
but I think they mostly pull in the same direction. (Only no ambient authority,
back compat, and extensibility exist in some tension.)

> Thanks again for pointing me in this direction. It helps a lot.
>
> Regards,
> Li

Glad you are sold on pidfds, and more broadly, best of luck! You'll be a hero
to everyone else that has wanted this over the years :)

John

Reply via email to