> On Feb 22, 2019, at 2:17 PM, Jann Horn <ja...@google.com> wrote: > > On Fri, Feb 22, 2019 at 11:08 PM Nadav Amit <na...@vmware.com> wrote: >>> On Feb 22, 2019, at 1:43 PM, Jann Horn <ja...@google.com> wrote: >>> >>> (adding some people from the text_poke series to the thread, removing >>> stable@) >>> >>> On Fri, Feb 22, 2019 at 8:55 PM Andy Lutomirski <l...@amacapital.net> wrote: >>>>> On Feb 22, 2019, at 11:34 AM, Alexei Starovoitov >>>>> <alexei.starovoi...@gmail.com> wrote: >>>>>> On Fri, Feb 22, 2019 at 02:30:26PM -0500, Steven Rostedt wrote: >>>>>> On Fri, 22 Feb 2019 11:27:05 -0800 >>>>>> Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote: >>>>>> >>>>>>>> On Fri, Feb 22, 2019 at 09:43:14AM -0800, Linus Torvalds wrote: >>>>>>>> >>>>>>>> Then we should still probably fix up "__probe_kernel_read()" to not >>>>>>>> allow user accesses. The easiest way to do that is actually likely to >>>>>>>> use the "unsafe_get_user()" functions *without* doing a >>>>>>>> uaccess_begin(), which will mean that modern CPU's will simply fault >>>>>>>> on a kernel access to user space. >>>>>>> >>>>>>> On bpf side the bpf_probe_read() helper just calls probe_kernel_read() >>>>>>> and users pass both user and kernel addresses into it and expect >>>>>>> that the helper will actually try to read from that address. >>>>>>> >>>>>>> If __probe_kernel_read will suddenly start failing on all user addresses >>>>>>> it will break the expectations. >>>>>>> How do we solve it in bpf_probe_read? >>>>>>> Call probe_kernel_read and if that fails call unsafe_get_user >>>>>>> byte-by-byte >>>>>>> in the loop? >>>>>>> That's doable, but people already complain that bpf_probe_read() is slow >>>>>>> and shows up in their perf report. >>>>>> >>>>>> We're changing kprobes to add a specific flag to say that we want to >>>>>> differentiate between kernel or user reads. Can this be done with >>>>>> bpf_probe_read()? If it's showing up in perf report, I doubt a single >>>>> >>>>> so you're saying you will break existing kprobe scripts? >>>>> I don't think it's a good idea. >>>>> It's not acceptable to break bpf_probe_read uapi. >>>> >>>> If so, the uapi is wrong: a long-sized number does not reliably identify >>>> an address if you don’t separately know whether it’s a user or kernel >>>> address. s390x and 4G:4G x86_32 are the notable exceptions. I have lobbied >>>> for RISC-V and future x86_64 to join the crowd. I don’t know whether I’ll >>>> win this fight, but the uapi will probably have to change for at least >>>> s390x. >>>> >>>> What to do about existing scripts is a different question. >>> >>> This lack of logical separation between user and kernel addresses >>> might interact interestingly with the text_poke series, specifically >>> "[PATCH v3 05/20] x86/alternative: Initialize temporary mm for >>> patching" >>> (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-6-rick.p.edgecombe%40intel.com%2F&data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&sdata=HAbnDcrBne64JyPuVUMKmM7nQk67F%2BFvjuXEn8TmHeo%3D&reserved=0) >>> and "[PATCH v3 06/20] x86/alternative: Use temporary mm for text >>> poking" >>> (https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20190221234451.17632-7-rick.p.edgecombe%40intel.com%2F&data=02%7C01%7Cnamit%40vmware.com%7Cf2513009ef734ecd6b0d08d69913a5ae%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636864707020821793&sdata=vNRIMKtFDy%2F3z5FlTwDiJY6VGEV%2FMHgQPTdFSFtCo4s%3D&reserved=0), >>> right? If someone manages to get a tracing BPF program to trigger in a >>> task that has switched to the patching mm, could they use >>> bpf_probe_write_user() - which uses probe_kernel_write() after >>> checking that KERNEL_DS isn't active and that access_ok() passes - to >>> overwrite kernel text that is mapped writable in the patching mm? >> >> Yes, this is a good point. I guess text_poke() should be defined with >> “__kprobes” and open-code memcpy. >> >> Does it sound reasonable? > > Doesn't __text_poke() as implemented in the proposed patch use a > couple other kernel functions, too? Like switch_mm_irqs_off() and > pte_clear() (which can be a call into a separate function on paravirt > kernels)?
I will move the pte_clear() to be done after the poking mm was unloaded. Give me a few minutes to send a sketch of what I think should be done.