Hello Benjamin,

On Tue, 01 Jul 2025 21:03:36 +0900,
Benjamin Berg wrote:
> 
> Hi Hajim,
> 
> On Mon, 2025-06-30 at 10:04 +0900, Hajime Tazaki wrote:
> > 
> > Hello Benjamin,
> > 
> > On Sat, 28 Jun 2025 00:02:05 +0900,
> > Benjamin Berg wrote:
> > > 
> > > Hi,
> > > 
> > > On Fri, 2025-06-27 at 22:50 +0900, Hajime Tazaki wrote:
> > > > thanks for the comment on the complicated part of the kernel (signal).
> > > 
> > > This stuff isn't simple.
> > > 
> > > Actually, I am starting to think that the current MMU UML kernel also
> > > needs a redesign with regard to signal handling and stack use in that
> > > case. My current impression is that the design right now only permits
> > > voluntarily scheduling. More specifically, scheduling in response to an
> > > interrupt is impossible.
> > > 
> > > I suppose that works fine, but it also does not seem quite right.
> > 
> > thanks for the info.  it's very useful to understand what's going on.
> > 
> > (snip)
> > 
> > > > > > +void set_mc_userspace_relay_signal(mcontext_t *mc)
> > > > > > +{
> > > > > > + mc->gregs[REG_RIP] = (unsigned long) __userspace_relay_signal;
> > > > > > +}
> > > > > > +
> > > > 
> > > > This is a bit scary code which I tried to handle when SIGSEGV is
> > > > raised by host for a userspace program running on UML (nommu).
> > > > 
> > > > # and I should remember my XXX tag is important to fix....
> > > > 
> > > > let me try to explain what happens and what I tried to solve.
> > > > 
> > > > The SEGV signal from userspace program is delivered to userspace but
> > > > if we don't fix the code raising the signal, after (um) rt_sigreturn,
> > > > it will restart from $rip and raise SIGSEGV again.
> > > > 
> > > > # so, yes, we've already relied on host and um's rt_sigreturn to
> > > >   restore various things.
> > > > 
> > > > when a uml userspace crashes with SIGSEGV,
> > > > 
> > > > - host kernel raises SIGSEGV (at original $rip)
> > > > - caught by uml process (hard_handler)
> > > > - raise a signal to uml userspace process (segv_handler)
> > > > - handler ends (hard_handler)
> > > > - (host) run restorer (rt_sigreturn, registered by (libc)sigaction,
> > > >   not (host) rt_sigaction)
> > > > - return back to the original $rip
> > > > - (back to top)
> > > > 
> > > > this is the case where endless loop is happened.
> > > > um's sa_handler isn't called as rt_sigreturn (um) isn't called.
> > > > and the my original attempt (__userspace_relay_signal) is what I tried.
> > > > 
> > > > I agree that it is lazy to call a dummy syscall (indeed, getpid).
> > > > I'm trying to introduce another routine to jump into userspace and
> > > > call (um) rt_sigreturn after (host) rt_sigreturn.
> > > > 
> > > > > And this is really confusing me. The way I am reading it, the code
> > > > > tries to do:
> > > > >    1. Rewrite RIP to jump to __userspace_relay_signal
> > > > >    2. Trigger a getpid syscall (to do "nothing"?)
> > > > >    3. Let do_syscall_64 fire the signal from interrupt_end
> > > > 
> > > > correct.
> > > > 
> > > > > However, then that really confuses me, because:
> > > > >  * If I am reading it correctly, then this approach will destroy the
> > > > >    contents of various registers (RIP, RAX and likely more)
> > > > >  * This would result in an incorrect mcontext in the userspace signal
> > > > >    handler (which could be relevant if userspace is inspecting it)
> > > > >  * However, worst, rt_sigreturn will eventually jump back
> > > > >    into__userspace_relay_signal, which has nothing to return to.
> > > > >  * Also, relay_signal doesn't use this? What happens for a SIGFPE, how
> > > > >    is userspace interrupted immediately in that case?
> > > > 
> > > > relay_signal shares the same goal of this, indeed.
> > > > but the issue with `mc->gregs[REG_RIP]` (endless signals) still exists
> > > > I guess.
> > > 
> > > Well, endless signals only exist as long as you exit to the same
> > > location. My suggestion was to read the user state from the mcontext
> > > (as SECCOMP mode does it) and executing the signal right away, i.e.:
> > 
> > thanks too;  below is my understanding.
> > 
> > >  * Fetch the current registers from the mcontext
> > 
> > I guess this is already done in sig_handler_common().
> 
> Well, not really?
> 
> It does seem to fetch the general purpose registers. But the code
> pretty much assumes we will return to the same location and only stores
> them on the stack for the signal handler itself. Also, remember that it
> might be userspace or kernel space in your case. The kernel task
> registers are in "switch_buf" while the userspace registers are in
> "regs" of "struct task_struct" (effectively "struct uml_pt_regs").

indeed, the handler returns to the same location.
here is what the current patchset does for the signal handling.

# sorry i might be writing same things several times, but I hope
  this will help to understand/discuss what it should be.

receive signal (from host)
- > call host sa_handler (hard_handler)
 - > sig_handler_common => get_regs_from_mc (fetch host mcontext to um)
 - > set TIF_SIGPENDING (um kernel)
 - > set host mcontext[RIP] to __userspace_relay_signal
(host sa_handler ends)
- call host sa_restorer => return to mcontext[RIP]
 - > call __userspace_relay_signal from mcontext[RIP]
 - > call interrupt_end()
 - > do_signal => handle_signal => setup_signal_stack_si
     (because TIF_SIGPENDING is on above)
 - > call userspace sa_handler
 - > call userspace sa_restorer

instead of set mcontext[RIP] to userspace sa_handler, it uses
__userspace_relay_signal, which configures stack and mcontext (via
interrupt_end, setup_signal_stack_si, etc) and call userspace
sa_handler/restorer after that.

in this way, programs runs userspace sa_handler not in the host
sa_handler context.  I guess this means we don't have to configure
host register/mcontext with the userspace one ?

I agree that the current __userspace_relay_signal can be shrunk not
to call __kernel_vsyscall and focus on interrupt_end and stack
preparation.

> > >  * Push the signal context onto the userspace stack
> > 
> > (guess) this is already done on handle_signal() => setup_signal_stack_si().
> > 
> > >  * Modify the host mcontext to set registers for the signal handler
> > 
> > this is something which I'm not well understanding.
> > - do you mean the host handler when you say "for the signal handler" ?
> >   or the userspace handler ?
> 
> Both in a way ;-)
> 
> I mean modify the registers in the host mcontext so that the UML
> userspace will continue executing inside its signal handler.
>
> > - if former (the host one), maybe mcontext is already there so, it
> >   might not be the one you mentioned.
> > - if the latter, how the original handler (the host one,
> >   hard_handler()) works ? even if we can call userspace handler
> >   instead of the host one, we need to call the host handler (and
> >   restorer).  do we call both ?
> > - and by "to set registers", what register do you mean ? for the
> >   registers inspected by userspace signal handler ?  but if you set a
> >   register, for instance RIP, as the fault location to the host
> >   register, it will return to RIP after handler and restart the fault
> >   again ?
> 
> I am confused, why would the fault handler be restarted? If you modify
> RIP, then the host kernel will not return to the faulting location. You
> were using that already to jump into __userspace_relay_signal. All I am
> arguing that instead of jumping to __userspace_relay_signal you can
> prepare everything and directly jump into the users signal handler.

what I meant in that example is; set host mcontext[RIP] to the fault
location, as a userspace information, which will lead to the fault
again.  But this doesn't change RIP before and after so, I guess this
isn't a good example..
Sorry for the confusion.

> > >  * Jump back to userspace by doing a "return"
> > 
> > this is still also unclear to me.
> > 
> > it would be very helpful if you point the location of the code (at
> > uml/next tree) on how SECCOMP mode does.  I'm also looking at but
> > really hard to map what you described and the code (sorry).
> 
> "stub_signal_interrupt" simply returns, which means it jumps into the
> restorer "stub_signal_restorer" which does the rt_sigreturn syscall.
> This means the host kernel restores the userspace state from the
> mcontext. As the mcontext resides in shared memory, the UML kernel can
> update it to specify where the process should continue running (thread
> switching, signals, syscall return value, …).

thanks !

so, stub_signal_interrupt runs on a different host process.
nommu mode tries to reuse existing host sa_handler (hard_handler) to
do the job (handle SEGV etc).

If there are something missing on hard_handler and co on nommmu mode
for what userspace_tramp does on seccomp mode, I've been trying to
update it.

-- Hajime

> 
> Benjamin
> 
> > all of above runs within hard_handler() in nommu mode on SIGSEGV.
> > my best guess is this is different from what ptrace/seccomp do.
> > 
> > > Said differently, I really prefer deferring as much logic as possible
> > > to the host. This is both safer and easier to understand. Plus, it also
> > > has the advantage of making it simpler to port UML to other
> > > architectures.
> > 
> > okay.
> > 
> > > 
> > > > > Honestly, I really think we should take a step back and swap the
> > > > > current syscall entry/exit code. That would likely also simplify
> > > > > floating point register handling, which I think is currently
> > > > > insufficient do deal with the odd special cases caused by different
> > > > > x86_64 hardware extensions.
> > > > > 
> > > > > Basically, I think nommu mode should use the same general approach as
> > > > > the current SECCOMP mode. Which is to use rt_sigreturn to jump into
> > > > > userspace and let the host kernel deal with the ugly details of how to
> > > > > do that.
> > > > 
> > > > I looked at how MMU mode (ptrace/seccomp) does handle this case.
> > > > 
> > > > In nommu mode, we don't have external process to catch signals so, the
> > > > nommu mode uses hard_handler() to catch SEGV/FPE of userspace
> > > > programs.  While mmu mode calls segv_handler not in a context of
> > > > signal handler.
> > > > 
> > > > # correct me if I'm wrong.
> > > > 
> > > > thus, mmu mode doesn't have this situation.
> > > 
> > > Yes, it does not have this specific issue. But see the top of the mail
> > > for other issues that are somewhat related.
> > > 
> > > > I'm attempting various ways; calling um's rt_sigreturn instead of
> > > > host's one, which doesn't work as host restore procedures (unblocking
> > > > masked signals, restoring register states, etc) aren't called.
> > > > 
> > > > I'll update here if I found a good direction, but would be great if
> > > > you see how it should be handled.
> > > 
> > > Can we please discuss possible solutions? We can figure out the details
> > > once it is clear how the interaction with the host should work.
> > 
> > I was wishing to update to you that I'm working on it.  So, your
> > comments are always helpful to me.  Thanks.
> > 
> > -- Hajime
> > 
> > > I still think that the idea of using the kernel task stack as the
> > > signal stack is really elegant. Actually, doing that in normal UML may
> > > be how we can fix the issues mentioned at the top of my mail. And for
> > > nommu, we can also use the host mcontext to jump back into userspace
> > > using a simple "return".
> > > 
> > > Conceptually it seems so simple.
> > > 
> > > Benjamin
> > > 
> > > 
> > > > 
> > > > -- Hajime
> > > > 
> > > > > I believe that this requires a second "userspace" sigaltstack in
> > > > > addition to the current "IRQ" sigaltstack. Then switching in between
> > > > > the two (note that the "userspace" one is also used for IRQs if those
> > > > > happen while userspace is executing).
> > > > > 
> > > > > So, in principle I would think something like:
> > > > >  * to jump into userspace, you would:
> > > > >     - block all signals
> > > > >     - set "userspace" sigaltstack
> > > > >     - setup mcontext for rt_sigreturn
> > > > >     - setup RSP for rt_sigreturn
> > > > >     - call rt_sigreturn syscall
> > > > >  * all signal handlers can (except pure IRQs):
> > > > >     - check on which stack they are
> > > > >       -> easy to detect whether we are in kernel mode
> > > > >     - for IRQs one can probably handle them directly (and return)
> > > > >     - in user mode:
> > > > >        + store mcontext location and information needed for 
> > > > > rt_sigreturn
> > > > >        + jump back into kernel task stack
> > > > >  * kernel task handler to continue would:
> > > > >     - set sigaltstack to IRQ stack
> > > > >     - fetch register from mcontext
> > > > >     - unblock all signals
> > > > >     - handle syscall/signal in whatever way needed
> > > > > 
> > > > > Now that I wrote about it, I am thinking that it might be possible to
> > > > > just use the kernel task stack for the signal stack. One would 
> > > > > probably
> > > > > need to increase the kernel stack size a bit, but it would also mean
> > > > > that no special code is needed for "rt_sigreturn" handling. The rest
> > > > > would remain the same.
> > > > > 
> > > > > Thoughts?
> > > > > 
> > > > > Benjamin
> > > > > 
> > > > > > [SNIP]
> > > > > 
> > > > 
> > > 
> > 
> 

Reply via email to