On 27/11/2017 13:24, Peter Maydell wrote: > On 27 November 2017 at 12:57, Adhemerval Zanella > <adhemerval.zane...@linaro.org> wrote: >> We found out this potential bogus assert on 2.27 development [1] which >> resulted in two fixes [2][3]. >> >> It should not be an issue for generic posix_spawn usage where there is >> no expectation system/user/program kills random pids (since posix_spawn >> auxiliary process has not yet returned). Some say the possible kind of >> behaviour is rather undefined, but it shouldn't also trigger an assert. >> >> I am not really sure what is happening in qemu usermode because comment >> #4 in the bug reports states clone is returning an error and it should >> not trigger the assert in first place. What seems to be happening in >> this scenario is clone is actually returning a success, but the auxiliary >> process is being killed before actually calling execve. > > The bug report is a bit confused, but I think what is happening > in the QEMU case is that QEMU implements clone(CLONE_VFORK) as having > the same semantics as fork() (ie the parent will not autowait for > the child, and the child does not share a memory map with the parent). > (ie QEMU treats it as having the semantics of a vfork() call, which > is allowed to be implemented as fork()).
Right, that explains what is happening. > Previous versions of glibc's posix_spawn() could cope with this > divergence from the kernel's native clone() behaviour, but the > rewrite can't. It's not unreasonable for glibc() to rely on the > kernel behaviour, but on the other hand it's not too surprising > if this breaks non-kernel implementations of the syscall ABI > like QEMU and the MS Linux subsystem, because it's a tricky > corner case that previously nobody was trying to use. The problem is vfork is such a broken API [1] that even POSIX has deprecated it on the latest 2008 standard. It was used on GLIBC posix_spawn on some specific usage (old POSIX_SPAWN_USEVFORK flag) only because it was 'faster' than using fork, however it also created its own set bugs [2][3][4][5]. Current implementation is as fast as using vfork on Linux using which should be platform neutral clone flags and assumptions (in fact we found out that Linux does not work as expected with clone (CLONE_VFORK | CLONE_VM) -> exit -> waitpid (WNOHANG) which resulted in aa95a2414). GLIBC also maintains another implementation at sysdeps/posix/spawni.c which should be more platform neutral since it uses only POSIX expected semantics (the synchronization is done using a pipe2 instead of CLONE_VM, so a vfork acting as fork shouldn't be a problem). It is not used in any architecture on GLIBC currently. However I am not very compelled to change internal posix_spawn on GLIBC on Linux mainly because it uses a slight less resources than the generic POSIX one (check e83be730910c) and it works on Linux kernel as expected. [1] https://ewontfix.com/7/ [2] https://sourceware.org/bugzilla/show_bug.cgi?id=14750 [3] https://sourceware.org/bugzilla/show_bug.cgi?id=14749 [4] https://sourceware.org/bugzilla/show_bug.cgi?id=14499 [5] https://sourceware.org/bugzilla/show_bug.cgi?id=10354 > > Unfortunately I can't really think of a mechanism for implementing > this in QEMU usermode, because the only tools we have available > for creating new threads and processes are the ones the host libc > gives us: so we can spawn new threads with pthread_create() and > fork the process with fork(), but we don't have a safe way to > create a new process which shares the memory map and where the > new process can call the various libc functions which QEMU will > do as it executes the guest code. Current GLIBC won't trigger any assert anymore (and it was backported to 2.25 and 2.26 branch as well), however I am not sure if posix_spawn semantic will works for all the expected scenarios in qemu user-mode. Most likely any failure (sched_set{param,scheduler}, setsid, setpgid, seteuid, any file action or execve itself) won't be advertise to main process, since err is set 0 as default and the auxiliary process will write to a expected shared memory to signalling an issue. Also, I don't think trying to emulate "CLONE_VM | CLONE_VFORK" with pthread_create without actually synchronize the threads will work as expected. If clone actually uses CLONE_VFORK I would expect the underlying qemu usermode to block the caller thread (using a condition variable or a barrier) and to release its execution only for execve or exit in the callee. I am not very versed on qemu code, so I am not sure how complex it would be.