On Fri, 2023-06-23 at 16:34 -0600, Rob Herring wrote: > > > > > Either way, the old patchset will give you a good idea about how it all > > works, the changes are mostly in the details. I am happy to push out a > > new version sooner rather than later if it might help with any efforts > > on your side. > > From a quick scan, it looks like there's some cleanups in the series > which would be helpful without seccomp parts. One of the initial > issues I've found is UML using older ptrace interfaces which arm64 > doesn't implement. PTRACE_GETREGS for example. >
I don't think that completely gets rid of PTRACE_GETREGS though, and if I remember correctly, we really kind of need that there? Though then again it's all been a while, and I only faulted the seccomp mode back in in discussions with Benjamin. Looks like we've found a potentially nicer way to make it secure than his secret-based approach, and in fact in a way that should even make it SMP-safe, at least in theory, obviously a lot of infrastructure is missing to make it SMP in the first place. Currently, UML has a host process per VMA. Obviously, you need multiple host processes for SMP (to get SMP), i.e. one per (used) CPU per VMA, with CLONE_VM. The problem with the secrets-based approach here for SMP is that the secret will be readable to the other running in the VMA (1) and then can be used for circumventing the protection by jumping into the stub area and calling host syscalls, see https://patchwork.ozlabs.org/project/linux-um/patch/20221122100759.208290-28-benja...@sipsolutions.net/ and https://patchwork.ozlabs.org/project/linux-um/patch/20221122100759.208290-24-benja...@sipsolutions.net/ Now the new idea we came up with is this: We can make the per-CPU VMA with CLONE_VM but *not* CLONE_FILES. Then, in the stub, when we need to execute some real host syscalls on behalf of the child, we * send the FD over in a message * use the FD for mmap (and also always use mmap instead of mprotect) * close the FD Without CLONE_FILES, another thread cannot "steal" the (real, host) FD, it's useless in the other thread. Note that I'm talking about host FDs here, in-UML FDs are just numbers and it works all differently, I'm just talking about executing host syscalls inside the VMA, to set up the VMA correctly etc. The BPF program now allows any the relevant syscalls (2) inside the stub area, but since you don't have the FD unless you actually executed the recvmsg() call at the beginning of the stub you can't do anything with that by jumping into the stub. There are some other details involved such as having to split the stub data into a read-only "what to execute" (3) and a writeable "results" page, but those are reasonably easy to deal with. At the end of the stub, of course the FD must be closed. This is fine though since mappings persist after the FD was closed. On the next page fault we have to do it all again, but yeah, page faults were always super expensive in UML ... johannes (1) unless you extract it directly out of the BPF program into registers via some magic syscalls that get a BPF return, but that's kind of icky too (2) mmap & munmap are the most relevant ones, some others but they're not that critical for security; notably mprotect is not allowed but must be done with mmap instead (3) so another thread can't actually overwrite the instructions of what the kernel wants to run inside the VMA process from another thread while it's happening _______________________________________________ linux-um mailing list linux-um@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-um