On Thu, May 28, 2026 at 2:55 PM Mateusz Guzik <[email protected]> wrote: > This problem is dear to my heart and I have been pondering it on and off > for some time now. The entire fork + exec idiom is terrible and needs to > be retired.
It seems to me like vfork+exec is a decent UAPI building block, on which you can build nice-looking userspace APIs, though I agree that this is not an ideal direct interface for application code. > Additionally there is a known problem where transiently copied file > descriptors on fork + exec cause a headache in multithreaded programs > doing something like this in parallel. I only did cursory reading, it > seems your patchset keeps the same problem in place. I think we almost have UAPI that would let you avoid this issue? You can use clone() with CLONE_FILES, then unshare the FD table with close_range(3, UINT_MAX, CLOSE_RANGE_UNSHARE). That is not currently implemented to be atomic with stuff that happens on other threads, but if we changed that, and it doesn't provide a good way to carry some FDs across, but it feels to me like this could be fixed with a variant of close_range() that removes O_CLOEXEC FDs except ones listed in an array. > There are numerous impactful ways to speed up execs both in terms of > single-threaded cost and their multicore scalability, most of which > would be immediately usable by all programs without an opt-in. imo these > needs to be exhausted before something like a "template" can be > considered. (I think probably a large part of this would be stuff that happens in userspace, like dynamic linking.) > Per the above, the primary win would stem from *NOT* messing with mm. As you write below, I think we have that with CLONE_MM? The C function vfork() is kind of a terrible API because of its returns-twice behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was wrapped by libc in a way similar to clone() (with the child executing a separate handler function), or if it was used in the implementation of some higher-level process-spawning API, it would be a perfectly fine API? Or am I misunderstanding what you mean by "messing with mm"? > As in, whatever the interface, it needs to create an "empty" target > process (for lack of a better term). > > In terms of userspace-visible APIs, a clean solution escapes me. I think we already have relatively good API for this - you can use clone() to create something that initially shares almost all the state that a thread would, and then incrementally unshare resources and go through execve().

