Hi Hajim, On Mon, 2025-06-30 at 10:04 +0900, Hajime Tazaki wrote: > > Hello Benjamin, > > On Sat, 28 Jun 2025 00:02:05 +0900, > Benjamin Berg wrote: > > > > Hi, > > > > On Fri, 2025-06-27 at 22:50 +0900, Hajime Tazaki wrote: > > > thanks for the comment on the complicated part of the kernel (signal). > > > > This stuff isn't simple. > > > > Actually, I am starting to think that the current MMU UML kernel also > > needs a redesign with regard to signal handling and stack use in that > > case. My current impression is that the design right now only permits > > voluntarily scheduling. More specifically, scheduling in response to an > > interrupt is impossible. > > > > I suppose that works fine, but it also does not seem quite right. > > thanks for the info. it's very useful to understand what's going on. > > (snip) > > > > > > +void set_mc_userspace_relay_signal(mcontext_t *mc) > > > > > +{ > > > > > + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal; > > > > > +} > > > > > + > > > > > > This is a bit scary code which I tried to handle when SIGSEGV is > > > raised by host for a userspace program running on UML (nommu). > > > > > > # and I should remember my XXX tag is important to fix.... > > > > > > let me try to explain what happens and what I tried to solve. > > > > > > The SEGV signal from userspace program is delivered to userspace but > > > if we don't fix the code raising the signal, after (um) rt_sigreturn, > > > it will restart from $rip and raise SIGSEGV again. > > > > > > # so, yes, we've already relied on host and um's rt_sigreturn to > > > restore various things. > > > > > > when a uml userspace crashes with SIGSEGV, > > > > > > - host kernel raises SIGSEGV (at original $rip) > > > - caught by uml process (hard_handler) > > > - raise a signal to uml userspace process (segv_handler) > > > - handler ends (hard_handler) > > > - (host) run restorer (rt_sigreturn, registered by (libc)sigaction, > > > not (host) rt_sigaction) > > > - return back to the original $rip > > > - (back to top) > > > > > > this is the case where endless loop is happened. > > > um's sa_handler isn't called as rt_sigreturn (um) isn't called. > > > and the my original attempt (__userspace_relay_signal) is what I tried. > > > > > > I agree that it is lazy to call a dummy syscall (indeed, getpid). > > > I'm trying to introduce another routine to jump into userspace and > > > call (um) rt_sigreturn after (host) rt_sigreturn. > > > > > > > And this is really confusing me. The way I am reading it, the code > > > > tries to do: > > > > 1. Rewrite RIP to jump to __userspace_relay_signal > > > > 2. Trigger a getpid syscall (to do "nothing"?) > > > > 3. Let do_syscall_64 fire the signal from interrupt_end > > > > > > correct. > > > > > > > However, then that really confuses me, because: > > > > * If I am reading it correctly, then this approach will destroy the > > > > contents of various registers (RIP, RAX and likely more) > > > > * This would result in an incorrect mcontext in the userspace signal > > > > handler (which could be relevant if userspace is inspecting it) > > > > * However, worst, rt_sigreturn will eventually jump back > > > > into__userspace_relay_signal, which has nothing to return to. > > > > * Also, relay_signal doesn't use this? What happens for a SIGFPE, how > > > > is userspace interrupted immediately in that case? > > > > > > relay_signal shares the same goal of this, indeed. > > > but the issue with `mc->gregs[REG_RIP]` (endless signals) still exists > > > I guess. > > > > Well, endless signals only exist as long as you exit to the same > > location. My suggestion was to read the user state from the mcontext > > (as SECCOMP mode does it) and executing the signal right away, i.e.: > > thanks too; below is my understanding. > > > * Fetch the current registers from the mcontext > > I guess this is already done in sig_handler_common().
Well, not really? It does seem to fetch the general purpose registers. But the code pretty much assumes we will return to the same location and only stores them on the stack for the signal handler itself. Also, remember that it might be userspace or kernel space in your case. The kernel task registers are in "switch_buf" while the userspace registers are in "regs" of "struct task_struct" (effectively "struct uml_pt_regs"). > > * Push the signal context onto the userspace stack > > (guess) this is already done on handle_signal() => setup_signal_stack_si(). > > > * Modify the host mcontext to set registers for the signal handler > > this is something which I'm not well understanding. > - do you mean the host handler when you say "for the signal handler" ? > or the userspace handler ? Both in a way ;-) I mean modify the registers in the host mcontext so that the UML userspace will continue executing inside its signal handler. > - if former (the host one), maybe mcontext is already there so, it > might not be the one you mentioned. > - if the latter, how the original handler (the host one, > hard_handler()) works ? even if we can call userspace handler > instead of the host one, we need to call the host handler (and > restorer). do we call both ? > - and by "to set registers", what register do you mean ? for the > registers inspected by userspace signal handler ? but if you set a > register, for instance RIP, as the fault location to the host > register, it will return to RIP after handler and restart the fault > again ? I am confused, why would the fault handler be restarted? If you modify RIP, then the host kernel will not return to the faulting location. You were using that already to jump into __userspace_relay_signal. All I am arguing that instead of jumping to __userspace_relay_signal you can prepare everything and directly jump into the users signal handler. > > * Jump back to userspace by doing a "return" > > this is still also unclear to me. > > it would be very helpful if you point the location of the code (at > uml/next tree) on how SECCOMP mode does. I'm also looking at but > really hard to map what you described and the code (sorry). "stub_signal_interrupt" simply returns, which means it jumps into the restorer "stub_signal_restorer" which does the rt_sigreturn syscall. This means the host kernel restores the userspace state from the mcontext. As the mcontext resides in shared memory, the UML kernel can update it to specify where the process should continue running (thread switching, signals, syscall return value, …). Benjamin > all of above runs within hard_handler() in nommu mode on SIGSEGV. > my best guess is this is different from what ptrace/seccomp do. > > > Said differently, I really prefer deferring as much logic as possible > > to the host. This is both safer and easier to understand. Plus, it also > > has the advantage of making it simpler to port UML to other > > architectures. > > okay. > > > > > > > Honestly, I really think we should take a step back and swap the > > > > current syscall entry/exit code. That would likely also simplify > > > > floating point register handling, which I think is currently > > > > insufficient do deal with the odd special cases caused by different > > > > x86_64 hardware extensions. > > > > > > > > Basically, I think nommu mode should use the same general approach as > > > > the current SECCOMP mode. Which is to use rt_sigreturn to jump into > > > > userspace and let the host kernel deal with the ugly details of how to > > > > do that. > > > > > > I looked at how MMU mode (ptrace/seccomp) does handle this case. > > > > > > In nommu mode, we don't have external process to catch signals so, the > > > nommu mode uses hard_handler() to catch SEGV/FPE of userspace > > > programs. While mmu mode calls segv_handler not in a context of > > > signal handler. > > > > > > # correct me if I'm wrong. > > > > > > thus, mmu mode doesn't have this situation. > > > > Yes, it does not have this specific issue. But see the top of the mail > > for other issues that are somewhat related. > > > > > I'm attempting various ways; calling um's rt_sigreturn instead of > > > host's one, which doesn't work as host restore procedures (unblocking > > > masked signals, restoring register states, etc) aren't called. > > > > > > I'll update here if I found a good direction, but would be great if > > > you see how it should be handled. > > > > Can we please discuss possible solutions? We can figure out the details > > once it is clear how the interaction with the host should work. > > I was wishing to update to you that I'm working on it. So, your > comments are always helpful to me. Thanks. > > -- Hajime > > > I still think that the idea of using the kernel task stack as the > > signal stack is really elegant. Actually, doing that in normal UML may > > be how we can fix the issues mentioned at the top of my mail. And for > > nommu, we can also use the host mcontext to jump back into userspace > > using a simple "return". > > > > Conceptually it seems so simple. > > > > Benjamin > > > > > > > > > > -- Hajime > > > > > > > I believe that this requires a second "userspace" sigaltstack in > > > > addition to the current "IRQ" sigaltstack. Then switching in between > > > > the two (note that the "userspace" one is also used for IRQs if those > > > > happen while userspace is executing). > > > > > > > > So, in principle I would think something like: > > > > * to jump into userspace, you would: > > > > - block all signals > > > > - set "userspace" sigaltstack > > > > - setup mcontext for rt_sigreturn > > > > - setup RSP for rt_sigreturn > > > > - call rt_sigreturn syscall > > > > * all signal handlers can (except pure IRQs): > > > > - check on which stack they are > > > > -> easy to detect whether we are in kernel mode > > > > - for IRQs one can probably handle them directly (and return) > > > > - in user mode: > > > > + store mcontext location and information needed for rt_sigreturn > > > > + jump back into kernel task stack > > > > * kernel task handler to continue would: > > > > - set sigaltstack to IRQ stack > > > > - fetch register from mcontext > > > > - unblock all signals > > > > - handle syscall/signal in whatever way needed > > > > > > > > Now that I wrote about it, I am thinking that it might be possible to > > > > just use the kernel task stack for the signal stack. One would probably > > > > need to increase the kernel stack size a bit, but it would also mean > > > > that no special code is needed for "rt_sigreturn" handling. The rest > > > > would remain the same. > > > > > > > > Thoughts? > > > > > > > > Benjamin > > > > > > > > > [SNIP] > > > > > > > > > >