On Wed, Dec 23, 2020 at 03:29:51PM -0700, Yu Zhao wrote: > I was hesitant to suggest the following because it isn't that straight > forward. But since you seem to be less concerned with the complexity, > I'll just bring it on the table -- it would take care of both ufd and > clear_refs_write, wouldn't it?
It certainly would since this is basically declaring "leaving stale TLB entries past mmap_read_lock" is now permitted as long as the pending flush counter is elevated under mmap_sem for reading. Anything that prevents uffd-wp to take mmap_write_lock looks great to me, anything, the below included, as long as it looks like easy to enforce and understand. And the below certainly is. My view is that the below is at the very least an improvement in terms of total complexity, compared to v5.10. At least it'll be documented. So what would be not ok to me is to depend on undocumented not guaranteed behavior in do_wp_page like the page_mapcount, which is what we had until now in clear_refs_write and in uffd-wp (but only if wrprotect raced against un-wrprotect, a tiny window if compared to clear_refs_write). Documenting that clearing pte_write and deferring the flush is allowed if mm_tlb_flush_pending was elevated before taking the PT lock is less complex and very well defined rule, if compared to what we had before in the page_mapcount dependency of clear_refs_write which was prone to break, and in fact it just did. We'll need a commentary in both userfaultfd_writeprotect and clear_refs_write that links to the below snippet. If we abstract it in a header we can hide there also a #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY=y && CONFIG_HAVE_ARCH_USERFAULTFD_WP=y && CONFIG_USERFAULTFD=y to make it even more explicit. However it may be simpler to keep it unconditional, I don't mind either ways. If it was up to me I'd write it to those 3 config options to be sure I remember where it comes from and to force any other user to register to be explicit they depend on that. > > diff --git a/mm/memory.c b/mm/memory.c > index 5e9ca612d7d7..af38c5ee327e 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault > *vmf) > goto unlock; > } > if (vmf->flags & FAULT_FLAG_WRITE) { > - if (!pte_write(entry)) > + if (!pte_write(entry)) { > + if (mm_tlb_flush_pending(vmf->vma->vm_mm)) > + flush_tlb_page(vmf->vma, vmf->address); > return do_wp_page(vmf); > + } > entry = pte_mkdirty(entry); > } > entry = pte_mkyoung(entry); >