On Sat, Feb 23, 2019 at 12:30 AM Nadav Amit <na...@vmware.com> wrote: > > > On Feb 22, 2019, at 3:59 PM, Andy Lutomirski <l...@kernel.org> wrote: > > > > On Fri, Feb 22, 2019 at 3:02 PM Jann Horn <ja...@google.com> wrote: > >> On Fri, Feb 22, 2019 at 11:39 PM Nadav Amit <na...@vmware.com> wrote: > >>>> On Feb 22, 2019, at 2:21 PM, Nadav Amit <na...@vmware.com> wrote: > >>>> > >>>>> On Feb 22, 2019, at 2:17 PM, Jann Horn <ja...@google.com> wrote: > >>>>> > >>>>> On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <na...@vmware.com> wrote: > >>>>>>> On Feb 22, 2019, at 1:43 PM, Jann Horn <ja...@google.com> wrote: > >>>>>>> > >>>>>>> (adding some people from the text_poke series to the thread, removing > >>>>>>> stable@) > >>>>>>> > >>>>>>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <l...@amacapital.net> > >>>>>>> wrote: > >>>>>>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov > >>>>>>>>> <alexei.starovoi...@gmail.com> wrote: > >>>>>>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote: > >>>>>>>>>> On Fri, 22 Feb 2019 11:27:05 -0800 > >>>>>>>>>> Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to > >>>>>>>>>>>> not > >>>>>>>>>>>> allow user accesses. The easiest way to do that is actually > >>>>>>>>>>>> likely to > >>>>>>>>>>>> use the "unsafe_get_user()" functions *without* doing a > >>>>>>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply > >>>>>>>>>>>> fault > >>>>>>>>>>>> on a kernel access to user space. > >>>>>>>>>>> > >>>>>>>>>>> On bpf side the bpf_probe_read() helper just calls > >>>>>>>>>>> probe_kernel_read() > >>>>>>>>>>> and users pass both user and kernel addresses into it and expect > >>>>>>>>>>> that the helper will actually try to read from that address. > >>>>>>>>>>> > >>>>>>>>>>> If __probe_kernel_read will suddenly start failing on all user > >>>>>>>>>>> addresses > >>>>>>>>>>> it will break the expectations. > >>>>>>>>>>> How do we solve it in bpf_probe_read? > >>>>>>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user > >>>>>>>>>>> byte-by-byte > >>>>>>>>>>> in the loop? > >>>>>>>>>>> That's doable, but people already complain that bpf_probe_read() > >>>>>>>>>>> is slow > >>>>>>>>>>> and shows up in their perf report. > >>>>>>>>>> > >>>>>>>>>> We're changing kprobes to add a specific flag to say that we want > >>>>>>>>>> to > >>>>>>>>>> differentiate between kernel or user reads. Can this be done with > >>>>>>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a > >>>>>>>>>> single > >>>>>>>>> > >>>>>>>>> so you're saying you will break existing kprobe scripts? > >>>>>>>>> I don't think it's a good idea. > >>>>>>>>> It's not acceptable to break bpf_probe_read uapi. > >>>>>>>> > >>>>>>>> If so, the uapi is wrong: a long-sized number does not reliably > >>>>>>>> identify an address if you don’t separately know whether it’s a user > >>>>>>>> or kernel address. s390x and 4G:4G x86_32 are the notable > >>>>>>>> exceptions. I have lobbied for RISC-V and future x86_64 to join the > >>>>>>>> crowd. I don’t know whether I’ll win this fight, but the uapi will > >>>>>>>> probably have to change for at least s390x. > >>>>>>>> > >>>>>>>> What to do about existing scripts is a different question. > >>>>>>> > >>>>>>> This lack of logical separation between user and kernel addresses > >>>>>>> might interact interestingly with the text_poke series, specifically > >>>>>>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for > >>>>>>> patching" > >>>>>>> (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&data=02%7C01%7Cnamit%40vmware.com%7Cbab53e52cc5c4ac4419008d69921d1f1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864767879731941&sdata=2tqD7udTCfNbcNLcj5SFpZt8WwK5NwtgaWMKm1Ye1EE%3D&reserved=0) > >>>>>>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text > >>>>>>> poking" > >>>>>>> (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&data=02%7C01%7Cnamit%40vmware.com%7Cbab53e52cc5c4ac4419008d69921d1f1%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864767879731941&sdata=7%2BLShgLKnra6xzSkxdJrCclCacfdE5IdczwScW83nuE%3D&reserved=0), > >>>>>>> right? If someone manages to get a tracing BPF program to trigger in a > >>>>>>> task that has switched to the patching mm, could they use > >>>>>>> bpf_probe_write_user() - which uses probe_kernel_write() after > >>>>>>> checking that KERNEL_DS isn't active and that access_ok() passes - to > >>>>>>> overwrite kernel text that is mapped writable in the patching mm? > >>>>>> > >>>>>> Yes, this is a good point. I guess text_poke() should be defined with > >>>>>> “__kprobes” and open-code memcpy. > >>>>>> > >>>>>> Does it sound reasonable? > >>>>> > >>>>> Doesn't __text_poke() as implemented in the proposed patch use a > >>>>> couple other kernel functions, too? Like switch_mm_irqs_off() and > >>>>> pte_clear() (which can be a call into a separate function on paravirt > >>>>> kernels)? > >>>> > >>>> I will move the pte_clear() to be done after the poking mm was unloaded. > >>>> Give me a few minutes to send a sketch of what I think should be done. > >>> > >>> Err.. You are right, I don’t see an easy way of preventing a kprobe from > >>> being set on switch_mm_irqs_off(), and open-coding this monster is too > >>> ugly. > >>> > >>> The reasonable solution seems to me as taking all the relevant pieces of > >>> code (and data) that might be used during text-poking and encapsulating > >>> them, so they > >>> will be set in a memory area which cannot be kprobe'd. This can also be > >>> useful to write-protect data structures of code that calls text_poke(), > >>> e.g., static-keys. It can also protect data on that stack that is used > >>> during text_poke() from being overwritten from another core. > >>> > >>> This solution is somewhat similar to Igor Stoppa’s idea of using > >>> “enclaves” > >>> when doing write-rarely operations. > >>> > >>> Right now, I think that text_poke() will keep being susceptible to such > >>> an attack, unless you have a better suggestion. > >> > >> A relatively simple approach might be to teach BPF not to run kprobe > >> programs and such in contexts where current->mm isn't the active mm? > >> Maybe using nmi_uaccess_okay(), or something like that? It looks like > >> perf_callchain_user() also already uses that. Except that a lot of > >> this code is x86-specific... > > > > This sounds like exactly the right solution. If you're running from > > some unknown context (like NMI or tracing), then you should check > > nmi_uaccess_okay(). I think we should just promote that to be a > > non-arch-specific function (that returns true by default) and check it > > the relevant bpf_probe_xyz() functions. > > I can do that, but notice that switch_mm_irqs_off() writes to > cpu_tlbstate.loaded_mm before it actually writes to CR3. So there are still > a couple of instructions (and the load_new_mm_cr3()) in between that a > kprobe can be set on, no?
But you can't mark then as no-nmi :) See the comment in nmi_uaccess_ok() -- the code is intended to work correctly during this window.