> On Jan 11, 2019, at 11:23 AM, h...@zytor.com wrote: > > On January 11, 2019 11:03:30 AM PST, Linus Torvalds > <torva...@linux-foundation.org> wrote: >> On Fri, Jan 11, 2019 at 7:15 AM Josh Poimboeuf <jpoim...@redhat.com> >> wrote: >>>> Now, in the int3 handler can you take the faulting RIP and search >> for it in >>>> the “static-calls” table, writing the RIP+5 (offset) into R10 >> (return >>>> address) and the target into R11. You make the int3 handler to >> divert the >>>> code execution by changing pt_regs->rip to point to a new function >> that does: >>>> push R10 >>>> jmp __x86_indirect_thunk_r11 >>>> >>>> And then you are done. No? >>> >>> IIUC, that sounds pretty much like what Steven proposed: >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181129122000.7fb4fb04%40gandalf.local.home&data=02%7C01%7Cnamit%40vmware.com%7Ca665a074940b4630e3fc08d677fa753b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636828314949874019&sdata=fs2uo%2BjL%2FV3rpzIHJ%2B3QoyHg4KhV%2B%2FUPmUOLpy8S8p8%3D&reserved=0 >>> I liked the idea, BUT, how would it work for callee-saved PV ops? In >>> that case there's only one clobbered register to work with (rax). >> >> Actually, there's a much simpler model now that I think about it. >> >> The BP fixup just fixes up %rip to to point to "bp_int3_handler". >> >> And that's just a random text address set up by "text_poke_bp()". >> >> So how about the static call rewriting simply do this: >> >> - for each static call: >> >> 1) create a fixup code stub that does >> >> push $returnaddressforTHIScall >> jmp targetforTHIScall >> >> 2) do >> >> on_each_cpu(do_sync_core, NULL, 1); >> >> to make sure all CPU's see this generated code >> >> 3) do >> >> text_poke_bp(addr, newcode, newlen, generatedcode); >> >> Ta-daa! Done. >> >> In fact, it turns out that even the extra "do_sync_core()" in #2 is >> unnecessary, because taking the BP will be serializing on the CPU that >> takes it, so we can skip it. >> >> End result: the text_poke_bp() function will do the two do_sync_core >> IPI's that guarantee that by the time it returns, no other CPU is >> using the generated code any more, so it can be re-used for the next >> static call fixup. >> >> Notice? No odd emulation, no need to adjust the stack in the BP >> handler, just the regular "return to a different IP". >> >> Now, there is a nasty special case with that stub, though. >> >> So nasty thing with the whole "generate a stub for each call" case: >> because it's dynamic and because of the re-use of the stub, you could >> be in the situation where: >> >> CPU1 CPU2 >> ---- ---- >> >> generate a stub >> on_each_cpu(do_sync_core..) >> text_poke_bp() >> ... >> >> rewrite to BP >> trigger the BP >> return to the stub >> fun the first instruction of the stub >> *INTERRUPT causes rescheduling* >> >> on_each_cpu(do_sync_core..) >> rewrite to good instruction >> on_each_cpu(do_sync_core..) >> >> free or re-generate the stub >> >> !! The stub is still in use !! >> >> So that simple "just generate the stub dynamically" isn't so simple >> after all. >> >> But it turns out that that is really simple to handle too. How do we do >> that? >> >> We do that by giving the BP handler *two* code sequences, and we make >> the BP handler pick one depending on whether it is returning to a >> "interrupts disabled" or "interrupts enabled" case. >> >> So the BP handler does this: >> >> - if we're returning with interrupts disabled, pick the simple stub >> >> - if we're returning with interrupts enabled, clkear IF in the return >> %rflags, and pick a *slightly* more complex stub: >> >> push $returnaddressforTHIScall >> sti >> jmp targetforTHIScall >> >> and now the STI shadow will mean that this sequence is uninterruptible. >> >> So we'd not do complex emulation of the call instruction at BP time, >> but we'd do that *trivial* change at BP time. >> >> This seems simple, doesn't need any temporary registers at all, and >> doesn't need any extra stack magic. It literally needs just a trivial >> sequence in poke_int3_handler(). >> >> The we'd change the end of poke_int3_handler() to do something like >> this instead: >> >> void *newip = bp_int3_handler; >> .. >> if (new == magic_static_call_bp_int3_handler) { >> if (regs->flags &X86_FLAGS_IF) { >> newip = magic_static_call_bp_int3_handler_sti; >> regs->flags &= ~X86_FLAGS_IF; >> } >> regs->ip = (unsigned long) newip; >> return 1; >> >> AAND now we're *really* done. >> >> Does anybody see any issues in this? >> >> Linus > > I still don't see why can't simply spin in the #BP handler until the patch is > complete. > > We can't have the #BP handler do any additional patching, as previously > discussed, but spinning should be perfectly safe. The simplest way to spin it > to just IRET; that both serializes and will re-take the exception if the > patch is still in progress. > > It requires exactly *no* awareness in the #BP handler, allows for the call to > be replaced with inline code or a simple NOP if desired (or vice versa, as > long as it is a single instruction.) > > If I'm missing something, then please educate me or point me to previous > discussion; I would greatly appreciate it.
One thing that comes to mind is that text_poke_bp() runs after patching the int3 and before patching in the instruction: on_each_cpu(do_sync_core, NULL, 1); If IRQs are disabled when the BP is hit, spinning can cause the system to hang.