On Tue, 18 Mar 2025 at 13:55, Daniel P. Berrangé <berra...@redhat.com> wrote:
>
> On Tue, Mar 18, 2025 at 01:06:17PM +0000, Peter Maydell wrote:
> > The difficulty with vfork() (and, more generally, with various of
> > the clone() syscall flag combinations) is that because we use the
> > host libc we are restricted to the thread/process creation options
> > that that libc permits: which is only fork() and pthread_create().
> > vfork() wants "create a new process like fork with its own file
> > descriptors, signal handlers, etc, but share all the memory space with
> > the parent", and the host libc just doesn't provide us with the tools
> > to do that. (We can't call the host vfork() because we wouldn't be
> > abiding by the rules it imposes, like "don't return from the function
> > that called vfork".)
> >
> > If we were implemented as a usermode emulator that sat on the raw
> > kernel syscalls, we could directly call the clone syscall and
> > use that to provide at least a wider range of the possible clone
> > flag options; but our dependency on libc means we have to avoid
> > doing things that would confuse it.
>
> I guess I'm not seeing how libc is blocking us in this respect ?
> The clone() syscall wrapper is exposed by glibc at least, and it
> is possible to call it, albeit with some caveats that we might
> miss any logic glibc has around its fork() wrapper. The spec
> requires that any child must immediately call execve after vfrok
> so I'm wondering just what risk of confusion we would have in
> practice ?

I think my notes about clone are a red herring for vfork
specifically. For vfork in the child, the vfork spec requires
a very minimal amount of stuff to happen in the child, but QEMU's
own TCG data structures and calls and processes mean that we
will be doing a lot more than the guest does. For instance,
we need to return from the function that called vfork, so we
can continue to execute the guest code. And the guest code will
likely call into the translator to generate more code, which will
(a) mess up the TCG data structures for the parent and (b)
probably result in our calling into libc functions that aren't
OK to call.

More generally, AIUI glibc expects that it has control over what's
happening with threads, so it can set up its own data structures
for the new thread (e.g. for TLS variables). This email from the
glibc mailing list is admittedly now two decades old
https://public-inbox.org/libc-alpha/200408042007.i74k7zor025...@magilla.sf.frob.com/
but it says:

# Basically, if you want to call libc functions you should do it from a
# thread that was set up by libc or libpthread.  i.e., if you make your own
# threads with clone, only call libc functions from the initial thread.

> > For vfork in particular, we could I guess do something like:
> >  * use real fork() to create child process
> >  * parent process arranges to wait until child process exits
> >    (via waitpid or equivalent) or it tells us it's about to exec
> >  * we make all the guest memory be mapped read-only in the child
> >    process, so we can trap writes and tell the parent about them
> >    so it can update its copy of the memory.
> >    (Sadly since we can't guaranteedly get control on termination
> >    events for the child before it really terminates, we can't
> >    do this memory-transfer in bulk at the end; otherwise we'd
> >    behave wrongly for the "child process gets SIGKILLed" case.)
>
> That would get the synchronization behaviour of Linux vfork,
> but I'm not sure it'd get the performance benefits (of avoiding
> page table copying) which is what  Andreas mentioned as the
> desired thing ?

The problem is that the guest glibc is using CLONE_VFORK in
a particular way for performance reasons on real hardware,
which is valid for real kernel CLONE_VFORK but which our
lack of accuracy in emulation means we mishandle, causing the
guest to fall over. The actual performance under QEMU isn't
important.

thanks
-- PMM

Reply via email to