> On Jun 11, 2019, at 2:20 PM, Thomas Hellström (VMware) 
> <thellst...@vmwopensource.org> wrote:
> 
> On 6/11/19 9:10 PM, Nadav Amit wrote:
>>> On Jun 11, 2019, at 11:26 AM, Thomas Hellström (VMware) 
>>> <thellst...@vmwopensource.org> wrote:
>>> 
>>> Hi, Nadav,
>>> 
>>> On 6/11/19 7:21 PM, Nadav Amit wrote:
>>>>> On Jun 11, 2019, at 5:24 AM, Thomas Hellström (VMware) 
>>>>> <thellst...@vmwopensource.org> wrote:
>>>>> 
>>>>> From: Thomas Hellstrom <thellst...@vmware.com>
>>>> [ snip ]
>>>> 
>>>>> +/**
>>>>> + * apply_pt_wrprotect - Leaf pte callback to write-protect a pte
>>>>> + * @pte: Pointer to the pte
>>>>> + * @token: Page table token, see apply_to_pfn_range()
>>>>> + * @addr: The virtual page address
>>>>> + * @closure: Pointer to a struct pfn_range_apply embedded in a
>>>>> + * struct apply_as
>>>>> + *
>>>>> + * The function write-protects a pte and records the range in
>>>>> + * virtual address space of touched ptes for efficient range TLB flushes.
>>>>> + *
>>>>> + * Return: Always zero.
>>>>> + */
>>>>> +static int apply_pt_wrprotect(pte_t *pte, pgtable_t token,
>>>>> +                       unsigned long addr,
>>>>> +                       struct pfn_range_apply *closure)
>>>>> +{
>>>>> + struct apply_as *aas = container_of(closure, typeof(*aas), base);
>>>>> + pte_t ptent = *pte;
>>>>> +
>>>>> + if (pte_write(ptent)) {
>>>>> +         pte_t old_pte = ptep_modify_prot_start(aas->vma, addr, pte);
>>>>> +
>>>>> +         ptent = pte_wrprotect(old_pte);
>>>>> +         ptep_modify_prot_commit(aas->vma, addr, pte, old_pte, ptent);
>>>>> +         aas->total++;
>>>>> +         aas->start = min(aas->start, addr);
>>>>> +         aas->end = max(aas->end, addr + PAGE_SIZE);
>>>>> + }
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * struct apply_as_clean - Closure structure for apply_as_clean
>>>>> + * @base: struct apply_as we derive from
>>>>> + * @bitmap_pgoff: Address_space Page offset of the first bit in @bitmap
>>>>> + * @bitmap: Bitmap with one bit for each page offset in the 
>>>>> address_space range
>>>>> + * covered.
>>>>> + * @start: Address_space page offset of first modified pte relative
>>>>> + * to @bitmap_pgoff
>>>>> + * @end: Address_space page offset of last modified pte relative
>>>>> + * to @bitmap_pgoff
>>>>> + */
>>>>> +struct apply_as_clean {
>>>>> + struct apply_as base;
>>>>> + pgoff_t bitmap_pgoff;
>>>>> + unsigned long *bitmap;
>>>>> + pgoff_t start;
>>>>> + pgoff_t end;
>>>>> +};
>>>>> +
>>>>> +/**
>>>>> + * apply_pt_clean - Leaf pte callback to clean a pte
>>>>> + * @pte: Pointer to the pte
>>>>> + * @token: Page table token, see apply_to_pfn_range()
>>>>> + * @addr: The virtual page address
>>>>> + * @closure: Pointer to a struct pfn_range_apply embedded in a
>>>>> + * struct apply_as_clean
>>>>> + *
>>>>> + * The function cleans a pte and records the range in
>>>>> + * virtual address space of touched ptes for efficient TLB flushes.
>>>>> + * It also records dirty ptes in a bitmap representing page offsets
>>>>> + * in the address_space, as well as the first and last of the bits
>>>>> + * touched.
>>>>> + *
>>>>> + * Return: Always zero.
>>>>> + */
>>>>> +static int apply_pt_clean(pte_t *pte, pgtable_t token,
>>>>> +                   unsigned long addr,
>>>>> +                   struct pfn_range_apply *closure)
>>>>> +{
>>>>> + struct apply_as *aas = container_of(closure, typeof(*aas), base);
>>>>> + struct apply_as_clean *clean = container_of(aas, typeof(*clean), base);
>>>>> + pte_t ptent = *pte;
>>>>> +
>>>>> + if (pte_dirty(ptent)) {
>>>>> +         pgoff_t pgoff = ((addr - aas->vma->vm_start) >> PAGE_SHIFT) +
>>>>> +                 aas->vma->vm_pgoff - clean->bitmap_pgoff;
>>>>> +         pte_t old_pte = ptep_modify_prot_start(aas->vma, addr, pte);
>>>>> +
>>>>> +         ptent = pte_mkclean(old_pte);
>>>>> +         ptep_modify_prot_commit(aas->vma, addr, pte, old_pte, ptent);
>>>>> +
>>>>> +         aas->total++;
>>>>> +         aas->start = min(aas->start, addr);
>>>>> +         aas->end = max(aas->end, addr + PAGE_SIZE);
>>>>> +
>>>>> +         __set_bit(pgoff, clean->bitmap);
>>>>> +         clean->start = min(clean->start, pgoff);
>>>>> +         clean->end = max(clean->end, pgoff + 1);
>>>>> + }
>>>>> +
>>>>> + return 0;
>>>> Usually, when a PTE is write-protected, or when a dirty-bit is cleared, the
>>>> TLB flush must be done while the page-table lock for that specific table is
>>>> taken (i.e., within apply_pt_clean() and apply_pt_wrprotect() in this 
>>>> case).
>>>> 
>>>> Otherwise, in the case of apply_pt_clean() for example, another core might
>>>> shortly after (before the TLB flush) write to the same page whose PTE was
>>>> changed. The dirty-bit in such case might not be set, and the change get
>>>> lost.
>>> Hmm. Let's assume that was the case, we have two possible situations:
>>> 
>>> A: pt_clean
>>> 
>>> 1. That core's TLB entry is invalid. It will set the PTE dirty bit and 
>>> continue. The dirty bit will probably remain set after the TLB flush.
>> I guess you mean the PTE is not cached in the TLB.
> Yes.
>>> 2. That core's TLB entry is valid. It will just continue. The dirty bit 
>>> will remain clear after the TLB flush.
>>> 
>>> But I fail to see how having the TLB flush within the page table lock would 
>>> help in this case. Since the writing core will never attempt to take it? In 
>>> any case, if such a race occurs, the corresponding bit in the bitmap would 
>>> have been set and we've recorded that the page is dirty.
>> I don’t understand. What do you mean “recorded that the page is dirty”?
>> IIUC, the PTE is clear in this case - you mean PG_dirty is set?
> 
> All PTEs we touch and clean are noted in the bitmap.
> 
>> To clarify, this code actually may work correctly on Intel CPUs, based on a
>> recent discussion with Dave Hansen. Apparently, most Intel CPUs set the
>> dirty bit in memory atomically when a page is first written.
>> 
>> But this is a generic code and not arch-specific. My concern is that a
>> certain page might be written to, but would not be marked as dirty in either
>> the bitmap or the PTE.
> 
> Regardless of arch, we have four cases:
> 1. Writes occuring before we scan (and possibly modify) the PTE. Should be 
> handled correctly.
> 2. Writes occurning after the TLB flush. Should be handled correctly, unless 
> after a TLB flush the TLB cached entry and the PTE differs on the dirty bit. 
> Then we could in theory go on writing without marking the PTE dirty. But that 
> would probably be an arch code bug: I mean anything using tlb_gather_mmu() 
> would flush TLB outside of the page table lock, and if, for example, 
> unmap_mapping_range() left the TLB entries and the PTES in an inconsistent 
> state, that wouldn't be good.
> 3. Writes occuring after the PTE scan, but before the TLB flush without us 
> modifying the PTE: That would be the same as a spurious TLB flush. It should 
> be harmless. The write will not be picked up in the bitmap, but the PTE dirty 
> bit will be set.
> 4. Writes occuring after us clearing the dirty bit and before the TLB flush: 
> We will detect the write, since the bitmap bit is already set. If the write 
> continues after the TLB flush, we go to 2.

Thanks for the detailed explanation. It does sound reasonable.

> Note, for archs doing software PTE_DIRTY, that would be very similar to 
> softdirty, which is also doing batched TLB flushes…

Somewhat similar. But AFAIK, soft-dirty allows you to do only one of two
things:

- Clear the refs ( using /proc/[pid]/clear_refs ); or
- Read the refs (using /proc/[pid]/pagemap )

This interface does not provide any atomicity like the one you try to
obtain.

Reply via email to