Hello,
On Mon, 28 Oct 2024 22:32:43 +0900, Benjamin Berg wrote: > > > > - a crash on userspace programs crashes a UML kernel, not signaling > > > > with SIGSEGV to the program. > > > > - commit c27e618 (during v6.12-rc1 merge) introduces invalid access to > > > > a vma structure for our case, which updates the internal procedure > > > > of maple_tree subsystem. We're trying to fix issue but still a > > > > random process on exit(2) crashes. > > > > > > Btw. are you handling FP register save/restore? If it is not there, it > > > probably would not be too hard to add (XSAVE, etc.), though it might > > > add a bit of additional overhead. Especially as UML always saves the FP > > > state rather than optimizing it like the x86 architectures. > > > > The patch handles fp register on entry/leave at syscall; [07/13] patch > > contains this part. > > That looks like FS/GS registers which are for thread-local storage. I > was talking about floating point registers. Maybe you meant another > patch? oh, this is my terrible mistake... no, the patch doesn't handle fp resister at all. > > I'm not familiar with that but what kind of optimizations does x86 > > architecture do for fp register handling ? > > The kernel does not usually need the FP registers. So it optimizes the > pretty common case of a userspace -> kernel -> userspace switch that > happens for a syscall by simply not saving/restoring these registers at > all. > > Obviously, it then still needs to do the work when the task is switched > or in the rare case that the kernel wants to use floating point itself. thanks for the information. > > > I am a bit confused overall. I mean, zpoline seems kind of neat, but a > > > requirement on patching userspace code also seems like a lot. > > > > > > To me, it seems much more natural to catch the userspace syscalls using > > > a SECCOMP filter[1]. While quite a lot slower, that should be much more > > > portable across architectures. For improved speed one could still do > > > architecture specific things inside the vDSO or by using zpoline. But > > > those would then "just" be optimizations and unpatched code would still > > > work correctly (e.g. JIT). > > > > I'm not proposing this patch to replace existing UML implementations; > > for instance, the patchset cannot run CONFIG_MMU code in the whole > > kernel tree so, existing ptrace-based implementation still has real > > usecase. and ptrace based syscall hook is not indeed fast and the > > improvements with seccomp filter instead clearly has benefits. I > > think it's independent to this patchset. > > Of course. nommu mode is a completely independent feature. > > I am still wondering a bit about the users for such a mode. It is not > interesting for us as we use it for testing. Of course, speed is nice > but it is not the primary objective. > > I understand that it can be an approach for a small "container", but > then you would need a very strict SECCOMP filter for the kernel itself. I didn't specifically describe the usecase for this at the v1 patch; but at least here is the list in my mind. 1) container-like usecase can be one of them (the original work proposed toward this), 2) testing nommu code in kernel might be another use, 3) faster I/O workload which involves bunch of syscalls over UML can be also interesting. I think this list covers pretty much to have !MMU mode in current MMU-full UML. speed might not be indeed the primary objective but if you'll see the dozen of test cases which issues bunch of syscalls (which I think possible case), this might be helpful. (snip) > > > For me, a big argument in favour of such an approach is its simplicity. > > > I am mostly basing that on the fact that this patchset should properly > > > handle other signals like SIGFPE and SIGSEGV. And, once it does that, > > > you will already have all the infrastructure to do the correct register > > > save/restore using the host mcontex, which is what is needed in the > > > SIGSYS handler when using SECCOMP. The filter itself should be simple > > > as it just needs to catch all syscalls within valid userspace > > > executable memory[2] ranges. > > > > I agree with your observation that the approach is simple. > > I don't have a good idea on how to handle SIGSEGV, but will try to see > > with your inputs. > > You can probably use "[RFC PATCH v2 5/9] um: Add helper functions to > get/set state for SECCOMP" for getting the registers and also writing > them back if you want to restore using rt_sigreturn. thanks, I'm still testing with various attempts to deliver SEGV to userspace, but yet no luck so far... I will get you back once I come up with a nice form. (snip) > > > [2] I am assuming that userspace executable code is already confined to > > > a certain address space within the UML process. Obviously, the kernel > > > itself and loaded modules need to be free to do host syscalls and > > > should not be affected by the SECCOMP filter. > > > > I think our !MMU UML doesn't break this assumption. But did you see > > something to our patchset ? > > I also assume that is fine. One just needs to understand this when > writing a SECCOMP filter for syscall emulation in nommu mode. okay, thanks for the clarification. -- Hajime