Hi Benjamin, On 5/28/24 6:30 PM, Benjamin Berg wrote: > On Tue, 2024-05-28 at 18:16 +0800, Tiwei Bie wrote: >> On 5/28/24 4:54 PM, benja...@sipsolutions.net wrote: >>> From: Benjamin Berg <benjamin.b...@intel.com> >>> >>> Newer glibc versions are enabling rseq support by default. This remains >>> enabled in the cloned child process, potentially causing the host kernel >>> to write/read memory in the child. >>> >>> It appears that this was purely not an issue because the used memory >>> area happened to be above TASK_SIZE and remains mapped. >> >> I also encountered this issue. In my case, with "Force a static link" >> (CONFIG_STATIC_LINK) enabled, UML will crash immediately every time >> it starts up. I worked around this by setting the glibc.pthread.rseq >> tunable via GLIBC_TUNABLES [1] before launching UML. >> >> So another easy way to work around this issue without introducing runtime >> overhead might be to add the GLIBC_TUNABLES=glibc.pthread.rseq=0 environment >> variable and exec /proc/self/exe in UML on startup. > > I am not really worried about the overhead, but I agree that setting > GLIBC_TUNABLES is also a reasonable solution to the problem. > > Doing the memfd/execveat dance with an embedded static binary would > still be best in my view, but either this or GLIBC_TUNABLES seem fine > in the meantime. > > Do you want to submit the patch? Should I re-roll the patchset with > GLIBC_TUNABLES?
Thanks for asking! :) I don't have such a patch at the moment. I worked around this issue using a script. I saw that you are already doing execve("/proc/self/exe", ...) in PATCH 4/5. Please just feel free to re-roll the patchset with GLIBC_TUNABLES if you would like to choose this solution. Regards, Tiwei > > Benjamin > >> [1] https://www.gnu.org/software/libc/manual/html_node/Tunables.html >> >> Regards, >> Tiwei >> >>> >>> Note that a better approach would be to exec a small static binary that >>> does not link with other libraries. Using a memfd and execveat the >>> binary could be embedded into UML itself and it would result in an >>> entirely clean execution environment for userspace. >>> >>> Signed-off-by: Benjamin Berg <benjamin.b...@intel.com> >>> --- >>> arch/um/os-Linux/skas/process.c | 54 ++++++++++++++++++++++++++++++--- >>> 1 file changed, 50 insertions(+), 4 deletions(-) >>> >>> diff --git a/arch/um/os-Linux/skas/process.c >>> b/arch/um/os-Linux/skas/process.c >>> index 41a288dcfc34..ee332a2aeea6 100644 >>> --- a/arch/um/os-Linux/skas/process.c >>> +++ b/arch/um/os-Linux/skas/process.c >>> @@ -255,6 +255,31 @@ static int userspace_tramp(void *stack) >>> int userspace_pid[NR_CPUS]; >>> int kill_userspace_mm[NR_CPUS]; >>> >>> +struct tramp_data { >>> + int pid; >>> + void *clone_sp; >>> + void *stack; >>> +}; >>> + >>> +static int userspace_tramp_clone_vm(void *data) >>> +{ >>> + struct tramp_data *tramp_data = data; >>> + >>> + /* >>> + * This helper exist to do a double-clone. First with CLONE_VM which >>> + * effectively disables things like rseq, and then the second one to >>> + * get a new memory space. >>> + */ >>> + >>> + tramp_data->pid = clone(userspace_tramp, tramp_data->clone_sp, >>> + CLONE_PARENT | CLONE_FILES | SIGCHLD, >>> + tramp_data->stack); >>> + if (tramp_data->pid < 0) >>> + tramp_data->pid = -errno; >>> + >>> + exit(0); >>> +} >>> + >>> /** >>> * start_userspace() - prepare a new userspace process >>> * @stub_stack: pointer to the stub stack. >>> @@ -268,9 +293,10 @@ int kill_userspace_mm[NR_CPUS]; >>> */ >>> int start_userspace(unsigned long stub_stack) >>> { >>> + struct tramp_data tramp_data; >>> void *stack; >>> unsigned long sp; >>> - int pid, status, n, flags, err; >>> + int pid, status, n, err; >>> >>> /* setup a temporary stack page */ >>> stack = mmap(NULL, UM_KERN_PAGE_SIZE, >>> @@ -286,10 +312,13 @@ int start_userspace(unsigned long stub_stack) >>> /* set stack pointer to the end of the stack page, so it can grow >>> downwards */ >>> sp = (unsigned long)stack + UM_KERN_PAGE_SIZE; >>> >>> - flags = CLONE_FILES | SIGCHLD; >>> + tramp_data.stack = (void *) stub_stack; >>> + tramp_data.clone_sp = (void *) sp; >>> + tramp_data.pid = -EINVAL; >>> >>> /* clone into new userspace process */ >>> - pid = clone(userspace_tramp, (void *) sp, flags, (void *) stub_stack); >>> + pid = clone(userspace_tramp_clone_vm, (void *) sp, >>> + CLONE_VM | CLONE_FILES | SIGCHLD, &tramp_data); >>> if (pid < 0) { >>> err = -errno; >>> printk(UM_KERN_ERR "%s : clone failed, errno = %d\n", >>> @@ -305,7 +334,24 @@ int start_userspace(unsigned long stub_stack) >>> __func__, errno); >>> goto out_kill; >>> } >>> - } while (WIFSTOPPED(status) && (WSTOPSIG(status) == SIGALRM)); >>> + } while (!WIFEXITED(status)); >>> + >>> + pid = tramp_data.pid; >>> + if (pid < 0) { >>> + printk(UM_KERN_ERR "%s : second clone failed, errno = %d\n", >>> + __func__, -pid); >>> + return pid; >>> + } >>> + >>> + do { >>> + CATCH_EINTR(n = waitpid(pid, &status, WUNTRACED | __WALL)); >>> + if (n < 0) { >>> + err = -errno; >>> + printk(UM_KERN_ERR "%s : wait failed, errno = %d\n", >>> + __func__, errno); >>> + goto out_kill; >>> + } >>> + } while (WIFEXITED(status) && (WSTOPSIG(status) == SIGALRM)); >>> >>> if (!WIFSTOPPED(status) || (WSTOPSIG(status) != SIGSTOP)) { >>> err = -EINVAL; >> >>