On 01/02/2019 03:51, Peter Xu wrote: > On Thu, Jan 31, 2019 at 12:25:40PM +0000, Jean-Philippe Brucker wrote: >> On 31/01/2019 07:59, Peter Xu wrote: >>> On Wed, Jan 30, 2019 at 12:27:40PM +0000, Jean-Philippe Brucker wrote: >>>> Hi Peter, >>> >>> Hi, Jean, >>> >>>> >>>> On 30/01/2019 05:57, Peter Xu wrote: >>>>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing >>>>> but that's actually already covered by invalidate_range(). Remove the >>>>> extra notifier and the chunks. >>>> >>>> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If >>>> I understood correctly, when doing reclaim the kernel clears the young >>>> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that >>>> new accesses from devices will go through the IOMMU, set the young bit >>>> again and the kernel can accurately track page use. I didn't see >>>> invalidate_range() being called by rmap or vmscan in this case, but >>>> might just be missing it. >>>> >>>> Two MMU notifiers are used for the young bit, clear_flush_young() and >>>> clear_young(). I believe the former should invalidate ATCs so that DMA >>>> accesses participate in PTE aging. Otherwise the kernel can't know that >>>> the device is still using entries marked 'old'. The latter, >>>> clear_young() is exempted from invalidating the secondary TLBs by >>>> mmu_notifier.h and the IOMMU driver doesn't need to implement it. >>> >>> Yes I agree most of your analysis, but IMHO the problem is that the >>> AMD IOMMU is not really participating in the PTE aging after all. >>> Please have a look at mn_clear_flush_young() below at [1] - it's >>> always returning zero, no matter whether the page has been accessed by >>> device or not. A real user of it could be someone like KVM (please >>> see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the >>> shadow PTEs and do test-and-clear on that, then return the result to >>> the core mm. That's why I think currently the AMD driver was only >>> trying to use that as a way to do an extra flush however IMHO it's >>> redundant. >> >> Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the >> flush, since the level-1 page table is shared with the CPU. But removing >> the flush gets you in the following situation: >> >> (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE >> young and the entry is cached in the ATC. >> >> (2) The CPU does ptep_clear_flush_young_notify(), clears young, >> notices that the page is being used. >> >> (3) Device accesses $addr again. Given that we didn't invalidate the >> ATC in (2) it accesses the page directly, without going through the IOMMU. >> >> (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't >> have the young bit, which means the page isn't being used and can be >> reclaimed. >> >> That's not accurate since the page is being used by the device. At step >> (2) we should invalidate the ATC, so that (3) fetches the PTE again and >> marks it young. >> >> I can agree that the clear_flush_young() notifier is too brutal for this >> purpose, since we send spurious ATC invalidation even when the PTE >> wasn't young (and ATC inv is expensive). There should be another MMU >> notifier "flush_young()" that is called by >> ptep_clear_flush_young_notify() only when the page was actually young. >> But for the moment it's the best we have to avoid the situation above. >> >> I don't know enough about mm to understand exactly how the >> page_referenced() information is used, but I believe the problem is only >> about accuracy and not correctness. So applying this patch might not >> break anything (after all, intel-svm.c never implemented the notifier) >> but I think I'll keep the notifier in my SVA rework [1]. > > I see your point. I'm not an expert of mm either, but I'd say AFAIU > this happens in CPU TLB too. Please have a look at > ptep_clear_flush_young(): > > int ptep_clear_flush_young(struct vm_area_struct *vma, > unsigned long address, pte_t *ptep) > { > /* > * On x86 CPUs, clearing the accessed bit without a TLB flush > * doesn't cause data corruption. [ It could cause incorrect > * page aging and the (mistaken) reclaim of hot pages, but the > * chance of that should be relatively low. ] > * > * So as a performance optimization don't flush the TLB when > * clearing the accessed bit, it will eventually be flushed by > * a context switch or a VM operation anyway. [ In the rare > * event of it not getting flushed for a long time the delay > * shouldn't really matter because there's no real memory > * pressure for swapout to react to. ] > */ > return ptep_test_and_clear_young(vma, address, ptep); > }
Aha I see. The arm64 version of ptep_clear_flush_young() does invalidate the TLB if the PTE was young (perhaps because we don't invalidate the TLB on context switch). For SVA I would have liked to simply invalidate the ATC whenever the CPU invalidates its TLB, but that falls apart if archs are doing different things... > So maybe it is a tradeoff between performance and accuracy. IMHO the > IOMMU cache flushing might affect the performance even more than CPU > TLB flushing if the invalidation command takes a long time to run > (e.g., amd_iommu_flush_page is far slower than a TLB flush > instruction, locks to take, queue commands, explicit wait for the > invalidation to happen, etc.) so it can potentially even drag down the > mm young bit access as a whole not to mention the future cache misses > from the device side. > > Even if you really want to make the young bit accurate for the SVA > work, IMHO you may still want to implement the lightweight version of > clear_young() too otherwise it might be inaccurate again in idle page > tracking. Yes there is another conversation about this on the new idle_pages proposal [1], which would never send any ATC invalidation. I'm not sure what we should do about it - making clear_young() flush the IOTLB seems way too expensive there as well, we'll probably want something more selective. > Another thing to mention is that disregarding the discussion about > young bit - IMO you should probably don't need the change_pte() which > I'm more confident with. I haven't dug too much into change_pte() yet, but I'll keep that in mind, thanks! Jean [1] https://patchwork.kernel.org/cover/10743133/#22446425 _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu