> On Jun 11, 2019, at 2:20 PM, Thomas Hellström (VMware) > <thellst...@vmwopensource.org> wrote: > > On 6/11/19 9:10 PM, Nadav Amit wrote: >>> On Jun 11, 2019, at 11:26 AM, Thomas Hellström (VMware) >>> <thellst...@vmwopensource.org> wrote: >>> >>> Hi, Nadav, >>> >>> On 6/11/19 7:21 PM, Nadav Amit wrote: >>>>> On Jun 11, 2019, at 5:24 AM, Thomas Hellström (VMware) >>>>> <thellst...@vmwopensource.org> wrote: >>>>> >>>>> From: Thomas Hellstrom <thellst...@vmware.com> >>>> [ snip ] >>>> >>>>> +/** >>>>> + * apply_pt_wrprotect - Leaf pte callback to write-protect a pte >>>>> + * @pte: Pointer to the pte >>>>> + * @token: Page table token, see apply_to_pfn_range() >>>>> + * @addr: The virtual page address >>>>> + * @closure: Pointer to a struct pfn_range_apply embedded in a >>>>> + * struct apply_as >>>>> + * >>>>> + * The function write-protects a pte and records the range in >>>>> + * virtual address space of touched ptes for efficient range TLB flushes. >>>>> + * >>>>> + * Return: Always zero. >>>>> + */ >>>>> +static int apply_pt_wrprotect(pte_t *pte, pgtable_t token, >>>>> + unsigned long addr, >>>>> + struct pfn_range_apply *closure) >>>>> +{ >>>>> + struct apply_as *aas = container_of(closure, typeof(*aas), base); >>>>> + pte_t ptent = *pte; >>>>> + >>>>> + if (pte_write(ptent)) { >>>>> + pte_t old_pte = ptep_modify_prot_start(aas->vma, addr, pte); >>>>> + >>>>> + ptent = pte_wrprotect(old_pte); >>>>> + ptep_modify_prot_commit(aas->vma, addr, pte, old_pte, ptent); >>>>> + aas->total++; >>>>> + aas->start = min(aas->start, addr); >>>>> + aas->end = max(aas->end, addr + PAGE_SIZE); >>>>> + } >>>>> + >>>>> + return 0; >>>>> +} >>>>> + >>>>> +/** >>>>> + * struct apply_as_clean - Closure structure for apply_as_clean >>>>> + * @base: struct apply_as we derive from >>>>> + * @bitmap_pgoff: Address_space Page offset of the first bit in @bitmap >>>>> + * @bitmap: Bitmap with one bit for each page offset in the >>>>> address_space range >>>>> + * covered. >>>>> + * @start: Address_space page offset of first modified pte relative >>>>> + * to @bitmap_pgoff >>>>> + * @end: Address_space page offset of last modified pte relative >>>>> + * to @bitmap_pgoff >>>>> + */ >>>>> +struct apply_as_clean { >>>>> + struct apply_as base; >>>>> + pgoff_t bitmap_pgoff; >>>>> + unsigned long *bitmap; >>>>> + pgoff_t start; >>>>> + pgoff_t end; >>>>> +}; >>>>> + >>>>> +/** >>>>> + * apply_pt_clean - Leaf pte callback to clean a pte >>>>> + * @pte: Pointer to the pte >>>>> + * @token: Page table token, see apply_to_pfn_range() >>>>> + * @addr: The virtual page address >>>>> + * @closure: Pointer to a struct pfn_range_apply embedded in a >>>>> + * struct apply_as_clean >>>>> + * >>>>> + * The function cleans a pte and records the range in >>>>> + * virtual address space of touched ptes for efficient TLB flushes. >>>>> + * It also records dirty ptes in a bitmap representing page offsets >>>>> + * in the address_space, as well as the first and last of the bits >>>>> + * touched. >>>>> + * >>>>> + * Return: Always zero. >>>>> + */ >>>>> +static int apply_pt_clean(pte_t *pte, pgtable_t token, >>>>> + unsigned long addr, >>>>> + struct pfn_range_apply *closure) >>>>> +{ >>>>> + struct apply_as *aas = container_of(closure, typeof(*aas), base); >>>>> + struct apply_as_clean *clean = container_of(aas, typeof(*clean), base); >>>>> + pte_t ptent = *pte; >>>>> + >>>>> + if (pte_dirty(ptent)) { >>>>> + pgoff_t pgoff = ((addr - aas->vma->vm_start) >> PAGE_SHIFT) + >>>>> + aas->vma->vm_pgoff - clean->bitmap_pgoff; >>>>> + pte_t old_pte = ptep_modify_prot_start(aas->vma, addr, pte); >>>>> + >>>>> + ptent = pte_mkclean(old_pte); >>>>> + ptep_modify_prot_commit(aas->vma, addr, pte, old_pte, ptent); >>>>> + >>>>> + aas->total++; >>>>> + aas->start = min(aas->start, addr); >>>>> + aas->end = max(aas->end, addr + PAGE_SIZE); >>>>> + >>>>> + __set_bit(pgoff, clean->bitmap); >>>>> + clean->start = min(clean->start, pgoff); >>>>> + clean->end = max(clean->end, pgoff + 1); >>>>> + } >>>>> + >>>>> + return 0; >>>> Usually, when a PTE is write-protected, or when a dirty-bit is cleared, the >>>> TLB flush must be done while the page-table lock for that specific table is >>>> taken (i.e., within apply_pt_clean() and apply_pt_wrprotect() in this >>>> case). >>>> >>>> Otherwise, in the case of apply_pt_clean() for example, another core might >>>> shortly after (before the TLB flush) write to the same page whose PTE was >>>> changed. The dirty-bit in such case might not be set, and the change get >>>> lost. >>> Hmm. Let's assume that was the case, we have two possible situations: >>> >>> A: pt_clean >>> >>> 1. That core's TLB entry is invalid. It will set the PTE dirty bit and >>> continue. The dirty bit will probably remain set after the TLB flush. >> I guess you mean the PTE is not cached in the TLB. > Yes. >>> 2. That core's TLB entry is valid. It will just continue. The dirty bit >>> will remain clear after the TLB flush. >>> >>> But I fail to see how having the TLB flush within the page table lock would >>> help in this case. Since the writing core will never attempt to take it? In >>> any case, if such a race occurs, the corresponding bit in the bitmap would >>> have been set and we've recorded that the page is dirty. >> I don’t understand. What do you mean “recorded that the page is dirty”? >> IIUC, the PTE is clear in this case - you mean PG_dirty is set? > > All PTEs we touch and clean are noted in the bitmap. > >> To clarify, this code actually may work correctly on Intel CPUs, based on a >> recent discussion with Dave Hansen. Apparently, most Intel CPUs set the >> dirty bit in memory atomically when a page is first written. >> >> But this is a generic code and not arch-specific. My concern is that a >> certain page might be written to, but would not be marked as dirty in either >> the bitmap or the PTE. > > Regardless of arch, we have four cases: > 1. Writes occuring before we scan (and possibly modify) the PTE. Should be > handled correctly. > 2. Writes occurning after the TLB flush. Should be handled correctly, unless > after a TLB flush the TLB cached entry and the PTE differs on the dirty bit. > Then we could in theory go on writing without marking the PTE dirty. But that > would probably be an arch code bug: I mean anything using tlb_gather_mmu() > would flush TLB outside of the page table lock, and if, for example, > unmap_mapping_range() left the TLB entries and the PTES in an inconsistent > state, that wouldn't be good. > 3. Writes occuring after the PTE scan, but before the TLB flush without us > modifying the PTE: That would be the same as a spurious TLB flush. It should > be harmless. The write will not be picked up in the bitmap, but the PTE dirty > bit will be set. > 4. Writes occuring after us clearing the dirty bit and before the TLB flush: > We will detect the write, since the bitmap bit is already set. If the write > continues after the TLB flush, we go to 2.
Thanks for the detailed explanation. It does sound reasonable. > Note, for archs doing software PTE_DIRTY, that would be very similar to > softdirty, which is also doing batched TLB flushes… Somewhat similar. But AFAIK, soft-dirty allows you to do only one of two things: - Clear the refs ( using /proc/[pid]/clear_refs ); or - Read the refs (using /proc/[pid]/pagemap ) This interface does not provide any atomicity like the one you try to obtain.