On January 11, 2019 11:03:30 AM PST, Linus Torvalds <torva...@linux-foundation.org> wrote: >On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf <jpoim...@redhat.com> >wrote: >> >> > >> > Now, in the int3 handler can you take the faulting RIP and search >for it in >> > the “static-calls” table, writing the RIP+5 (offset) into R10 >(return >> > address) and the target into R11. You make the int3 handler to >divert the >> > code execution by changing pt_regs->rip to point to a new function >that does: >> > >> > push R10 >> > jmp __x86_indirect_thunk_r11 >> > >> > And then you are done. No? >> >> IIUC, that sounds pretty much like what Steven proposed: >> >> >https://lkml.kernel.org/r/20181129122000.7fb4f...@gandalf.local.home >> >> I liked the idea, BUT, how would it work for callee-saved PV ops? In >> that case there's only one clobbered register to work with (rax). > >Actually, there's a much simpler model now that I think about it. > >The BP fixup just fixes up %rip to to point to "bp_int3_handler". > >And that's just a random text address set up by "text_poke_bp()". > >So how about the static call rewriting simply do this: > > - for each static call: > > 1) create a fixup code stub that does > > push $returnaddressforTHIScall > jmp targetforTHIScall > > 2) do > > on_each_cpu(do_sync_core, NULL, 1); > > to make sure all CPU's see this generated code > > 3) do > > text_poke_bp(addr, newcode, newlen, generatedcode); > >Ta-daa! Done. > >In fact, it turns out that even the extra "do_sync_core()" in #2 is >unnecessary, because taking the BP will be serializing on the CPU that >takes it, so we can skip it. > >End result: the text_poke_bp() function will do the two do_sync_core >IPI's that guarantee that by the time it returns, no other CPU is >using the generated code any more, so it can be re-used for the next >static call fixup. > >Notice? No odd emulation, no need to adjust the stack in the BP >handler, just the regular "return to a different IP". > >Now, there is a nasty special case with that stub, though. > >So nasty thing with the whole "generate a stub for each call" case: >because it's dynamic and because of the re-use of the stub, you could >be in the situation where: > > CPU1 CPU2 > ---- ---- > > generate a stub > on_each_cpu(do_sync_core..) > text_poke_bp() > ... > > rewrite to BP > trigger the BP > return to the stub > fun the first instruction of the stub > *INTERRUPT causes rescheduling* > > on_each_cpu(do_sync_core..) > rewrite to good instruction > on_each_cpu(do_sync_core..) > > free or re-generate the stub > > !! The stub is still in use !! > >So that simple "just generate the stub dynamically" isn't so simple >after all. > >But it turns out that that is really simple to handle too. How do we do >that? > >We do that by giving the BP handler *two* code sequences, and we make >the BP handler pick one depending on whether it is returning to a >"interrupts disabled" or "interrupts enabled" case. > >So the BP handler does this: > > - if we're returning with interrupts disabled, pick the simple stub > > - if we're returning with interrupts enabled, clkear IF in the return >%rflags, and pick a *slightly* more complex stub: > > push $returnaddressforTHIScall > sti > jmp targetforTHIScall > >and now the STI shadow will mean that this sequence is uninterruptible. > >So we'd not do complex emulation of the call instruction at BP time, >but we'd do that *trivial* change at BP time. > >This seems simple, doesn't need any temporary registers at all, and >doesn't need any extra stack magic. It literally needs just a trivial >sequence in poke_int3_handler(). > >The we'd change the end of poke_int3_handler() to do something like >this instead: > > void *newip = bp_int3_handler; > .. > if (new == magic_static_call_bp_int3_handler) { > if (regs->flags &X86_FLAGS_IF) { > newip = magic_static_call_bp_int3_handler_sti; > regs->flags &= ~X86_FLAGS_IF; > } > regs->ip = (unsigned long) newip; > return 1; > >AAND now we're *really* done. > >Does anybody see any issues in this? > > Linus
I still don't see why can't simply spin in the #BP handler until the patch is complete. We can't have the #BP handler do any additional patching, as previously discussed, but spinning should be perfectly safe. The simplest way to spin it to just IRET; that both serializes and will re-take the exception if the patch is still in progress. It requires exactly *no* awareness in the #BP handler, allows for the call to be replaced with inline code or a simple NOP if desired (or vice versa, as long as it is a single instruction.) If I'm missing something, then please educate me or point me to previous discussion; I would greatly appreciate it. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.