Hi Mateusz, ---- On Thu, 28 May 2026 20:55:32 +0800 Mateusz Guzik <[email protected]> wrote --- > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote: > > This RFC adds spawn_template, a userspace-controlled exec acceleration > > mechanism for runtimes that repeatedly start the same executable with > > different argv, envp, and per-spawn file descriptor setup. > > > > The main target is agent runtimes. Modern coding agents repeatedly start > > short-lived helper tools such as rg, git, sed, awk, python, node, and > > shell wrappers while they inspect and edit a workspace. Those runtimes > > already know which tools are hot, and they are also the right place to > > decide policy. The kernel does not choose names such as rg, git, or sed. > > Userspace opts in by creating a template fd for one executable, then uses > > that fd for later spawns. Launchers, shells, and build systems have a > > similar repeated-startup shape and could use the same primitive, but the > > agent runtime case is the main motivation for this RFC. > > > [..] > > A typical agent runtime would keep one template per hot executable and > > still build argv, envp, cwd, and pipe wiring for each tool call: > > > > rg_tmpl = spawn_template_create("/usr/bin/rg"); > > > > for each search request: > > out_r, out_w = pipe_cloexec(); > > err_r, err_w = pipe_cloexec(); > > actions = [ > > FCHDIR(worktree_fd), > > DUP2(out_w, STDOUT_FILENO), > > DUP2(err_w, STDERR_FILENO), > > ]; > > child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions); > > close(out_w); > > close(err_w); > > read out_r and err_r; > > waitid(P_PIDFD, child.pidfd, ...); > > > > > [..] > > The cached state is intentionally small. The template fd keeps the opened > > main executable file, an optional absolute path string, the creator > > credential pointer, and the deny-write state. The executable identity key > > records device, inode, size, mode, owner, ctime, and mtime, and is > > rechecked before cached metadata is used. The ELF cache keeps only the > > main executable's ELF header, program header table, and program header > > count. > > > > cached in this RFC not cached in this RFC > > ------------------ ---------------------- > > opened main executable PT_INTERP metadata > > executable identity key shared-library graph > > main ELF header VMA layout metadata > > main ELF program headers cross-process metadata sharing > > creator cred pointer > > deny-write state > > > > This RFC does not cache ELF interpreter metadata, shared-library > > dependency state, or derived mapping-layout state. Shared-library > > resolution is dynamic linker policy and depends on LD_LIBRARY_PATH, > > RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec > > state. It also does not share cached executable metadata between template > > fds created by different processes. Each template owns its small cached > > metadata object in this RFC. > > > > Performance > > =========== > > > [..] > > Workload Calls subprocess spawn_template time_s Delta > > (workers) calls calls/s calls/s seconds > > 1x16 6144 411.04 420.32 14.95/14.62 +2.26% > > 2x8 6144 666.78 690.08 9.21/8.90 +3.49% > > 4x4 6144 955.61 1003.25 6.43/6.12 +4.99% > > 8x2 6144 1048.25 1069.18 5.86/5.75 +2.00% > > > > This problem is dear to my heart and I have been pondering it on and off > for some time now. The entire fork + exec idiom is terrible and needs tox > be retired. > > Is this vibe-coded? I asked claude for in-kernel posix_spawn for kicks > some time ago and it generated remarkably similar code. But that's a > tangent.
Partly, yes. The original idea came from using agents myself and noticing that they spend a lot of time starting short-lived tools such as rg, sed, git, bash, and python. I was wondering whether repeated tool calls could be made cheaper. After that I used an LLM to bounce around the smallest kernel prototype for the idea. I did some review, patch split, test, benchmark, leak-check work, and throw away some cache codes that not actually useful. > I'm rather confused by the angle in the patchset. Most of this shaves > off a tiny amount of work, while retaining the primary avoidable reason > for bad performance: the very fact that fork is part of the picture, > especially the part mucking with mm. Creating a pristine process is the > way to go. > > Additionally there is a known problem where transiently copied file > descriptors on fork + exec cause a headache in multithreaded programs > doing something like this in parallel. I only did cursory reading, it > seems your patchset keeps the same problem in place. > > There are numerous impactful ways to speed up execs both in terms of > single-threaded cost and their multicore scalability, most of which > would be immediately usable by all programs without an opt-in. imo these > needs to be exhausted before something like a "template" can be > considered. > > Per the above, the primary win would stem from *NOT* messing with mm. > > As in, whatever the interface, it needs to create an "empty" target > process (for lack of a better term). > > In terms of userspace-visible APIs, a clean solution escapes me. > > Some time ago I proposed returning a handle which is populated over time > by the parnet-to-be. One of the problems with it I failed to consider at > the time is NUMA locality -- what if the process to be created is going > to run on another domain? For example, opening and installing a file for > its later use will result in avoidable loss of locality for some of the > in-kernel data. That's on top of the fd vs fork problem. > > From perf standpoint, the final goal of whatever mechanism should be a > state where the target process avoided copying any state it did not need > to and which allocated any memory it needed from local NUMA node > (whatever it may happen to be). Of course if no affinity is assigned it > may happen to move again and lose such locality, nothing can be done > about that. But pretend the process is to run in a specific node the > parent is NOT running in. > > So I think the pragmatic way forward is to implement something close to > posix_spawn in the kernel. It may make sense for the thing to take the > PATH argument for repeated exec attempts. I understand this is of no use > in your particular case, but it very much IS of use for most of the > real-world. The initial implementation might even start with doing vfork > just to get it off the ground. > > The next step would be to extend the interface with means to AVOID > copying any file descriptors. There could be a dedicated file action > which tells the kernel to avoid such copies or something like a > close_range file action (or close_from) -- with a range like <0, INT_MAX> > you know no fds are copied. > > For the NUMA angle to be sorted out, any file action which opens a file > or dups from the parent needs to execute in the child. And frankly > something would be needed to ask the scheduler where does it think the > child is going to run, so that the task_struct itself can also be > allocated with the right backing. > > I have not looked into what's needed to create a new process and NOT > mess with mm, but I don't think there are unsolvable problems there, at > worst some churn. > > There are of course other parameters which need to be sorted out, that's > covered by the posix_spawn thing. > > This e-mail is long enough, so I'm not going to go into issues > concerning exec itself right now. > > tl;dr I would suggest redoing the patchset as posix_spawn and then doing > the actual optimization of not cloning mm itself. > Thanks a lot for writing this up. I clearly had too narrow a view of the problem. I was mostly thinking about repeated executable startup, but your reply and Christian's and Andy's made me see that the more useful target is probably a pidfd/pidfs-backed process builder which can sit under posix_spawn, and then grow into something that avoids the fork-shaped mm and fd costs. I learned a lot from this thread. At a high level, Windows CreateProcess/NtCreateUserProcess also looks closer to this direction than fork+exec: create the target process directly, pass explicit startup attributes and handle inheritance state, and avoid starting from a copy of the parent address space. That seems to be the same basic advantage here: build the child closer to its final shape instead of copying parent state and then throwing much of it away. I will study the process creation, exec, pidfd/pidfs, and posix_spawn codes more carefully, then try the direction you suggested and benchmark the mm/fd costs. Regards, Li

