Hello Benjamin,
On Sat, 28 Jun 2025 00:02:05 +0900, Benjamin Berg wrote: > > Hi, > > On Fri, 2025-06-27 at 22:50 +0900, Hajime Tazaki wrote: > > thanks for the comment on the complicated part of the kernel (signal). > > This stuff isn't simple. > > Actually, I am starting to think that the current MMU UML kernel also > needs a redesign with regard to signal handling and stack use in that > case. My current impression is that the design right now only permits > voluntarily scheduling. More specifically, scheduling in response to an > interrupt is impossible. > > I suppose that works fine, but it also does not seem quite right. thanks for the info. it's very useful to understand what's going on. (snip) > > > > +void set_mc_userspace_relay_signal(mcontext_t *mc) > > > > +{ > > > > + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal; > > > > +} > > > > + > > > > This is a bit scary code which I tried to handle when SIGSEGV is > > raised by host for a userspace program running on UML (nommu). > > > > # and I should remember my XXX tag is important to fix.... > > > > let me try to explain what happens and what I tried to solve. > > > > The SEGV signal from userspace program is delivered to userspace but > > if we don't fix the code raising the signal, after (um) rt_sigreturn, > > it will restart from $rip and raise SIGSEGV again. > > > > # so, yes, we've already relied on host and um's rt_sigreturn to > > restore various things. > > > > when a uml userspace crashes with SIGSEGV, > > > > - host kernel raises SIGSEGV (at original $rip) > > - caught by uml process (hard_handler) > > - raise a signal to uml userspace process (segv_handler) > > - handler ends (hard_handler) > > - (host) run restorer (rt_sigreturn, registered by (libc)sigaction, > > not (host) rt_sigaction) > > - return back to the original $rip > > - (back to top) > > > > this is the case where endless loop is happened. > > um's sa_handler isn't called as rt_sigreturn (um) isn't called. > > and the my original attempt (__userspace_relay_signal) is what I tried. > > > > I agree that it is lazy to call a dummy syscall (indeed, getpid). > > I'm trying to introduce another routine to jump into userspace and > > call (um) rt_sigreturn after (host) rt_sigreturn. > > > > > And this is really confusing me. The way I am reading it, the code > > > tries to do: > > > 1. Rewrite RIP to jump to __userspace_relay_signal > > > 2. Trigger a getpid syscall (to do "nothing"?) > > > 3. Let do_syscall_64 fire the signal from interrupt_end > > > > correct. > > > > > However, then that really confuses me, because: > > > * If I am reading it correctly, then this approach will destroy the > > > contents of various registers (RIP, RAX and likely more) > > > * This would result in an incorrect mcontext in the userspace signal > > > handler (which could be relevant if userspace is inspecting it) > > > * However, worst, rt_sigreturn will eventually jump back > > > into__userspace_relay_signal, which has nothing to return to. > > > * Also, relay_signal doesn't use this? What happens for a SIGFPE, how > > > is userspace interrupted immediately in that case? > > > > relay_signal shares the same goal of this, indeed. > > but the issue with `mc->gregs[REG_RIP]` (endless signals) still exists > > I guess. > > Well, endless signals only exist as long as you exit to the same > location. My suggestion was to read the user state from the mcontext > (as SECCOMP mode does it) and executing the signal right away, i.e.: thanks too; below is my understanding. > * Fetch the current registers from the mcontext I guess this is already done in sig_handler_common(). > * Push the signal context onto the userspace stack (guess) this is already done on handle_signal() => setup_signal_stack_si(). > * Modify the host mcontext to set registers for the signal handler this is something which I'm not well understanding. - do you mean the host handler when you say "for the signal handler" ? or the userspace handler ? - if former (the host one), maybe mcontext is already there so, it might not be the one you mentioned. - if the latter, how the original handler (the host one, hard_handler()) works ? even if we can call userspace handler instead of the host one, we need to call the host handler (and restorer). do we call both ? - and by "to set registers", what register do you mean ? for the registers inspected by userspace signal handler ? but if you set a register, for instance RIP, as the fault location to the host register, it will return to RIP after handler and restart the fault again ? > * Jump back to userspace by doing a "return" this is still also unclear to me. it would be very helpful if you point the location of the code (at uml/next tree) on how SECCOMP mode does. I'm also looking at but really hard to map what you described and the code (sorry). all of above runs within hard_handler() in nommu mode on SIGSEGV. my best guess is this is different from what ptrace/seccomp do. > Said differently, I really prefer deferring as much logic as possible > to the host. This is both safer and easier to understand. Plus, it also > has the advantage of making it simpler to port UML to other > architectures. okay. > > > > Honestly, I really think we should take a step back and swap the > > > current syscall entry/exit code. That would likely also simplify > > > floating point register handling, which I think is currently > > > insufficient do deal with the odd special cases caused by different > > > x86_64 hardware extensions. > > > > > > Basically, I think nommu mode should use the same general approach as > > > the current SECCOMP mode. Which is to use rt_sigreturn to jump into > > > userspace and let the host kernel deal with the ugly details of how to > > > do that. > > > > I looked at how MMU mode (ptrace/seccomp) does handle this case. > > > > In nommu mode, we don't have external process to catch signals so, the > > nommu mode uses hard_handler() to catch SEGV/FPE of userspace > > programs. While mmu mode calls segv_handler not in a context of > > signal handler. > > > > # correct me if I'm wrong. > > > > thus, mmu mode doesn't have this situation. > > Yes, it does not have this specific issue. But see the top of the mail > for other issues that are somewhat related. > > > I'm attempting various ways; calling um's rt_sigreturn instead of > > host's one, which doesn't work as host restore procedures (unblocking > > masked signals, restoring register states, etc) aren't called. > > > > I'll update here if I found a good direction, but would be great if > > you see how it should be handled. > > Can we please discuss possible solutions? We can figure out the details > once it is clear how the interaction with the host should work. I was wishing to update to you that I'm working on it. So, your comments are always helpful to me. Thanks. -- Hajime > I still think that the idea of using the kernel task stack as the > signal stack is really elegant. Actually, doing that in normal UML may > be how we can fix the issues mentioned at the top of my mail. And for > nommu, we can also use the host mcontext to jump back into userspace > using a simple "return". > > Conceptually it seems so simple. > > Benjamin > > > > > > -- Hajime > > > > > I believe that this requires a second "userspace" sigaltstack in > > > addition to the current "IRQ" sigaltstack. Then switching in between > > > the two (note that the "userspace" one is also used for IRQs if those > > > happen while userspace is executing). > > > > > > So, in principle I would think something like: > > > * to jump into userspace, you would: > > > - block all signals > > > - set "userspace" sigaltstack > > > - setup mcontext for rt_sigreturn > > > - setup RSP for rt_sigreturn > > > - call rt_sigreturn syscall > > > * all signal handlers can (except pure IRQs): > > > - check on which stack they are > > > -> easy to detect whether we are in kernel mode > > > - for IRQs one can probably handle them directly (and return) > > > - in user mode: > > > + store mcontext location and information needed for rt_sigreturn > > > + jump back into kernel task stack > > > * kernel task handler to continue would: > > > - set sigaltstack to IRQ stack > > > - fetch register from mcontext > > > - unblock all signals > > > - handle syscall/signal in whatever way needed > > > > > > Now that I wrote about it, I am thinking that it might be possible to > > > just use the kernel task stack for the signal stack. One would probably > > > need to increase the kernel stack size a bit, but it would also mean > > > that no special code is needed for "rt_sigreturn" handling. The rest > > > would remain the same. > > > > > > Thoughts? > > > > > > Benjamin > > > > > > > [SNIP] > > > > > >